CN113724055A - Commodity attribute mining method and device - Google Patents

Commodity attribute mining method and device Download PDF

Info

Publication number
CN113724055A
CN113724055A CN202111076600.2A CN202111076600A CN113724055A CN 113724055 A CN113724055 A CN 113724055A CN 202111076600 A CN202111076600 A CN 202111076600A CN 113724055 A CN113724055 A CN 113724055A
Authority
CN
China
Prior art keywords
commodity
attribute
sentence
short sentence
short
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111076600.2A
Other languages
Chinese (zh)
Other versions
CN113724055B (en
Inventor
陈东东
章钦
郭雪茹
周源
易津锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jingdong Technology Information Technology Co Ltd
Original Assignee
Jingdong Technology Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jingdong Technology Information Technology Co Ltd filed Critical Jingdong Technology Information Technology Co Ltd
Priority to CN202111076600.2A priority Critical patent/CN113724055B/en
Publication of CN113724055A publication Critical patent/CN113724055A/en
Application granted granted Critical
Publication of CN113724055B publication Critical patent/CN113724055B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0623Item investigation
    • G06Q30/0625Directed, with specific intent or strategy
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Marketing (AREA)
  • Artificial Intelligence (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Development Economics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Business, Economics & Management (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure provides a commodity attribute mining method and device. The method comprises the following steps: determining a mapping relation between the commodity attribute number and a short sentence describing the commodity attribute; inputting the short sentence into a preset vectorization model to obtain a sentence vector which is output by the vectorization model and represents the semantics of the short sentence; the vectorization model is obtained by training a short sentence set of labeled attribute data corresponding to the target commodity class on the basis of a pre-training network model; and determining the commodity attribute characteristics corresponding to the commodity attribute numbers based on the commodity public attribute characteristics corresponding to the cluster of the sentence vectors and the mapping relation. The method provided by the disclosure can effectively analyze the unstructured data, fully excavate the attribute features hidden by the commodity, and improve the efficiency of excavating the commodity attributes.

Description

Commodity attribute mining method and device
Technical Field
The disclosure relates to the technical field of big data analysis, in particular to a commodity attribute mining method and device. In addition, an electronic device and a processor-readable storage medium are also related.
Background
For each item sold on the network platform, in addition to the structured information in the database describing the characteristics of the item, some unstructured information is available, the most important of which is the information in the picture of the item detail page. Generally, each product attribute number corresponds to a plurality of product detail page pictures, each product detail page picture comprises a plurality of advertisement sentences or explanatory sentences, and each sentence describes the characteristics of a certain dimension of a product. How to dig out the commodity attribute features expressed by each sentence based on the corresponding relation and further reversely push out the hidden attribute features of each commodity attribute number, so that the problems of commodity recommendation, commodity customization, selling point digging and the like become urgent to solve.
Disclosure of Invention
Therefore, the invention provides a commodity attribute mining method and a commodity attribute mining device, which aim to overcome the defects that the manual mining scheme and the computer-assisted mining scheme in the prior art are high in limitation, time-consuming and labor-consuming, prone to generating honor attributes and requiring more manual intervention, so that the commodity attribute mining efficiency and stability are poor.
The present disclosure provides a method for mining commodity attributes, including:
determining a mapping relation between the commodity attribute number and a short sentence describing the commodity attribute; wherein the short sentence is in the commodity detail page picture;
inputting the short sentence into a preset vectorization model to obtain a sentence vector which is output by the vectorization model and represents the semantics of the short sentence;
the vectorization model is obtained by training a short sentence set of labeled attribute data corresponding to the target commodity class on the basis of a pre-training network model;
and determining the commodity attribute characteristics corresponding to the commodity attribute numbers based on the commodity public attribute characteristics corresponding to the cluster of the sentence vectors and the mapping relation.
Further, determining the commodity attribute characteristics corresponding to the commodity attribute numbers based on the commodity public attribute characteristics corresponding to the cluster of the sentence vectors and the mapping relationship, specifically including:
clustering the sentence vectors to obtain corresponding clustering clusters;
extracting commodity attribute features of the short sentence commonalities in the clustering cluster, and taking the commodity attribute features of the commonalities as the commodity public attribute features corresponding to the clustering cluster;
and matching the commodity public attribute characteristics corresponding to the cluster with the mapping relation to obtain the commodity attribute characteristics of the commodity attribute numbers.
Further, the clustering the sentence vectors to obtain corresponding clustering clusters specifically includes: based on the semantics of the sentence corresponding to the sentence vector, clustering the sentence vector by using a K-means clustering model to obtain a clustering cluster containing a plurality of sentence vectors; and the semantics of the short sentences corresponding to the sentence vectors in the clustering cluster meet a preset semantic approximate condition.
Further, the determining the mapping relationship between the commodity attribute number and the short sentence describing the commodity attribute specifically includes:
mapping the commodity detail page picture to the commodity attribute number to obtain an initial mapping relation between the commodity attribute number and the commodity detail page picture;
recognizing the text information of the commodity detail page picture by using an optical character recognition mode to obtain a short sentence which is corresponding to the commodity detail page picture and describes the commodity attribute;
and determining the mapping relation according to the initial mapping relation and a short sentence which is corresponding to the commodity detail page picture and describes the commodity attribute.
Further, the product attribute mining method further includes:
based on a preset triple loss function, gathering sentence vectors corresponding to similar phrases in the phrases in a space, keeping away the sentence vectors corresponding to non-similar phrases in the space, and adding corresponding pseudo marks according to the spatial distance for sample sentences not labeled with attribute data in the phrases so as to determine a training sample set of the pre-training network model;
training the pre-training network model based on the training sample set.
Further, based on the short sentence set marked with the attribute data, the pre-training network model is trained in a transfer learning mode to obtain the vectorization model.
Further, the product attribute number is stock management information of the product, and the stock management information refers to a numeric code or an alphabetic code for uniquely identifying the product.
The present disclosure also provides a commodity attribute excavating apparatus, including:
the mapping relation determining unit is used for determining the mapping relation between the commodity attribute number and the short sentence describing the commodity attribute; wherein the short sentence is in the commodity detail page picture;
the vectorization processing unit is used for inputting the short sentence into a preset vectorization model to obtain a sentence vector which is output by the vectorization model and represents the semantic meaning of the short sentence;
the vectorization model is obtained by training a short sentence set of labeled attribute data corresponding to the target commodity class on the basis of a pre-training network model;
and the commodity attribute characteristic determining unit is used for determining the commodity attribute characteristics corresponding to the commodity attribute numbers based on the commodity public attribute characteristics corresponding to the clustering clusters of the sentence vectors and the mapping relation.
Further, the commodity attribute feature determining unit is specifically configured to:
clustering the sentence vectors to obtain corresponding clustering clusters;
extracting commodity attribute features of the short sentence commonalities in the clustering cluster, and taking the commodity attribute features of the commonalities as the commodity public attribute features corresponding to the clustering cluster;
and matching the commodity public attribute characteristics corresponding to the cluster with the mapping relation to obtain the commodity attribute characteristics of the commodity attribute numbers.
Further, the clustering the sentence vectors to obtain corresponding clustering clusters specifically includes: based on the semantics of the sentence corresponding to the sentence vector, clustering the sentence vector by using a K-means clustering model to obtain a clustering cluster containing a plurality of sentence vectors; and the semantics of the short sentences corresponding to the sentence vectors in the clustering cluster meet a preset semantic approximate condition.
Further, the mapping relationship determining unit is specifically configured to:
mapping the commodity detail page picture to the commodity attribute number to obtain an initial mapping relation between the commodity attribute number and the commodity detail page picture;
recognizing the text information of the commodity detail page picture by using an optical character recognition mode to obtain a short sentence which is corresponding to the commodity detail page picture and describes the commodity attribute;
and determining the mapping relation according to the initial mapping relation and a short sentence which is corresponding to the commodity detail page picture and describes the commodity attribute.
Further, the product attribute mining device further includes:
a sample set determining unit, configured to gather, in a space, sentence vectors corresponding to similar phrases in the phrases based on a preset triple loss function, and to keep away, in the space, sentence vectors corresponding to non-similar phrases in the phrases, and to add corresponding pseudo labels to sample sentences not labeled with attribute data in the phrases according to spatial distances, so as to determine a training sample set of the pre-training network model;
and the model training unit is used for training the pre-training network model based on the training sample set.
Further, based on the short sentence set marked with the attribute data, the pre-training network model is trained in a transfer learning mode to obtain the vectorization model.
Further, the product attribute number is stock management information of the product, and the stock management information refers to a numeric code or an alphabetic code for uniquely identifying the product.
The present disclosure also provides an electronic device, comprising: a memory, a processor and a computer program stored on the memory and operable on the processor, the processor implementing the steps of the method for mining the attributes of a good as described in any one of the above.
The present disclosure also provides a processor-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method for mining the attributes of goods as described in any one of the above.
The commodity attribute mining method provided by the disclosure obtains a sentence vector representing the semantics of a short sentence by determining the mapping relation between a commodity attribute number and the short sentence used for describing the commodity attribute in a commodity detail page picture and inputting the short sentence into a preset vectorization model; clustering the sentence vectors, and determining commodity attribute characteristics corresponding to the commodity attribute numbers based on the commodity public attribute characteristics corresponding to the clustering clusters of the sentence vectors and the mapping relation; the vectorization model is obtained by training with a short sentence set of labeled attribute data corresponding to the target commodity class on the basis of a pre-training network model. The method can effectively analyze the unstructured data, fully excavate the attribute features hidden by the commodity, and improve the efficiency of excavating the commodity attributes.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present disclosure, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flow chart of a method for mining commodity attributes according to an embodiment of the present disclosure;
fig. 2 is a complete flow diagram of a method for mining commodity attributes according to an embodiment of the present disclosure;
fig. 3 is a schematic structural diagram of a product attribute mining device according to an embodiment of the present disclosure;
fig. 4 is a schematic physical structure diagram of an electronic device provided in an embodiment of the present disclosure.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, but not all embodiments of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.
The following describes an embodiment of the product attribute mining method in detail based on the present disclosure. As shown in fig. 1, which is a schematic flow chart of a method for mining an attribute of a commodity according to an embodiment of the present disclosure, a specific implementation process includes the following steps:
step 101: and determining the mapping relation between the commodity attribute number and a short sentence used for describing the commodity attribute in the commodity detail page picture.
Specifically, the Stock Keeping Unit (SKU), that is, the Stock Keeping management information, refers to a numeric code or an alphabetical code assigned to a product to uniquely identify the product attribute, so that an enterprise can manage Stock more easily and efficiently. The number or code of the item attribute number is typically between 8 and 12 characters and is located on the price label of the item. The commodity detail page picture is a picture containing commodity detail information on an online sales counter, and the commodity detail information is unstructured information. For each item, in addition to the structured information in the database describing the characteristics of the item, some unstructured information is available, the most important of which is the information in the item detail page picture. Each commodity attribute number corresponds to at least one commodity detail page picture, each commodity detail page picture comprises a plurality of advertisement sentences or explanatory sentences, and each sentence describes a commodity attribute feature (or a nonsensical sentence) of a certain dimension of a commodity. The short sentence is the advertisement sentences or the explanatory sentences, and each commodity detail page picture corresponds to at least one short sentence describing the commodity attribute.
In this step, the product detail page picture is mapped to the product attribute number to obtain an initial mapping relationship between the product attribute number and the product detail page picture. Further, identifying the text information of the commodity detail page picture to obtain a short sentence corresponding to the commodity detail page picture and describing the commodity attribute, and determining the mapping relation among the commodity attribute number, the commodity detail page picture and the short sentence describing the commodity attribute according to the initial mapping relation and the short sentence corresponding to the commodity detail page picture and describing the commodity attribute.
It should be noted that, in the practical implementation process of the present invention, the text information of the item detail page picture may be recognized by a manner including, but not limited to, OCR (Optical Character Recognition). Specifically, when the text information of the commodity detail page picture is identified in an optical character identification mode, each commodity detail page picture can be mapped to the corresponding commodity attribute number through related operations of the database, and the text information on the commodity detail page picture is read through an optical character identification technology. Generally, one product attribute number corresponds to a plurality of product detail page pictures, and one product detail page picture corresponds to a plurality of OCR phrases, i.e. phrases describing the product attribute. After the text information on the commodity detail page picture is read, the text information can be cleaned, short sentences of which the fonts do not meet preset conditions are filtered, short sentences which can be directly used for inputting a vectorization model and are used for describing commodity attributes are obtained, and finally the mapping relation of the commodity attribute numbers, the commodity detail page picture and the short sentences is obtained.
Step 102: and inputting the short sentence into a preset vectorization model to obtain a sentence vector which is output by the vectorization model and represents the semantics of the short sentence. The vectorization model is obtained by training a short sentence set of labeled attribute data corresponding to the target commodity class on the basis of a pre-training network model.
In the embodiment of the invention, before the step is executed, fine-tuning (fine-tuning) is performed by using artificially labeled commodity attribute data on the basis of a Bert model in advance to obtain the finely tuned Bert model, a vectorization model capable of effectively distinguishing short sentence semantics is further trained, clustering is performed by using output sentence vectors of the vectorization model, and a plurality of clustering clusters of short sentences describing different commodity attributes are obtained based on clustering results, so that hidden commodity attribute information can be mined on new unlabeled data.
It should be noted that the Bert model is a pre-trained model. Assuming that a training set A exists, the initial network model is pre-trained by the training set A, network parameters are learned on a task of the training set A and then stored for later use, when a new task B comes, the same network structure can be adopted, the network parameters learned by the training set A can be loaded when the network parameters are initialized, then the network model is trained by training data of the task B, when the loaded parameters are kept unchanged, the loaded parameters are called as "freqen", and when the loaded network parameters are continuously fine-tuned along with the training of the task B, the loaded network parameters are called as "fine-tuning", namely, the network parameters are better adjusted to be more suitable for the current task B. In the embodiment of the present invention, the training set a may refer to a short sentence set of labeled attribute data corresponding to the target commodity category; the B training set corresponding to the task B may refer to an unlabeled sample statement set.
As shown in fig. 2, in a specific implementation process, the Bert model refers to an embedding model of NLP, and a corresponding output sentence vector can be obtained by inputting a short sentence, wherein a Class label of an output first dimension can represent category information of the whole short sentence, and short sentences with similar semantics are closer in a vector space. The invention can adopt NLP-Bert-as-service model, also can use Bert model (Chinese-BERT-wwm), the said model has already trained in Chinese character data set of billion level, therefore can obtain the semantic information in the short sentence well.
In this step, after the short sentence is input into a preset vectorization model to obtain a sentence vector which is output by the vectorization model and represents the semantic of the short sentence, the sentence vectors corresponding to the similar short sentences in the short sentence can be gathered in a space based on a preset triple loss function, the sentence vectors corresponding to the non-similar short sentences in the short sentence are kept away in the space, corresponding pseudo labels are added according to the spatial distance for sample sentences which are not labeled with attribute data in the short sentence to determine a training sample set of the pre-training network model, and the pre-training network model is trained based on the training sample set.
It should be noted that, in the actual implementation process, because the input short sentence is related to the text information of a specific commodity class, some professional words and specific field expressions may appear, which may affect the effect of the Bert model. Therefore, in the embodiment of the present invention, based on the phrase set partially subjected to the manual attribute information labeling (i.e. the phrase set labeled with the attribute data corresponding to the target commodity category), the phrase set labeled with the attribute data is used for performing the transfer learning on the basis of the Class label output by the Bert model, so that the final output vector better meets the requirements of the field to which the target commodity category belongs.
Because the output form of the vectorization model is a sentence vector representing the semantics of the short sentences and the category marking information of the short sentence set which is partially marked manually exists, a triple loss function (triple loss) can be selected as a loss function in the specific implementation process. The system hopes that the finally obtained commodity attribute information is not only limited to manually labeled categories, but also hopes that new categories which are not manually labeled can be automatically learned. The principle of triple loss is: each time a triplet (a, b, c) is selected, wherein a, b belong to the same category, a, c belong to different categories, and the loss function is the distance between a and b minus the distance between a and c plus a margin value, by means of continuous back propagation, the distance between a and b is as close as possible to a margin distance above the distance between a and c. Finally, the vectors corresponding to the similar clauses are gathered in the space, and the vectors corresponding to the non-similar clauses are far away in the space. The labeled data here is often a relatively small amount that is labeled manually, so the nearest sample and the farthest sample are calculated by embedding each unlabeled data, and then provided with the pseudo-label of the Triplet loss training. Specifically, the training set of the fine-tuned Bert model is gradually increased by using the nearest samples as positive and the farthest samples as negative. In addition, in the migration learning process, the migration learning layer can select two layers of fully-connected networks to achieve good effect, a Class label vector output by the Bert model is input, and a final embedding vector, namely a sentence vector, is output.
Step 103: and determining the commodity attribute characteristics corresponding to the commodity attribute numbers based on the commodity public attribute characteristics corresponding to the cluster of the sentence vectors and the mapping relation.
In the embodiment of the present invention, the commodity attribute feature corresponding to the commodity attribute number is determined based on the commodity public attribute feature corresponding to the cluster of the sentence vector and the mapping relationship, and the specific implementation process includes: clustering the sentence vectors to obtain corresponding clustering clusters; extracting commodity attribute features of the short sentence commonalities in the clustering cluster, and taking the commodity attribute features of the commonalities as the commodity public attribute features corresponding to the clustering cluster; and matching the commodity public attribute characteristics corresponding to the cluster with the mapping relation to obtain the commodity attribute characteristics of the commodity attribute numbers. After the short sentence is vectorized, corresponding relation exists between the short sentence and the obtained corresponding sentence vector, so that the corresponding relation between the cluster formed by the sentence vector and the short sentence can be obtained, and further the corresponding commodity attribute characteristics can be determined by matching the corresponding relation with the mapping relation. The common commodity attribute feature refers to the commodity attribute feature of a plurality of short sentences in the cluster.
The sentence vectors are clustered to obtain corresponding clustering clusters, and the specific implementation process comprises the following steps: based on the semantics of the sentence corresponding to the sentence vector, clustering the sentence vector to obtain a cluster containing a plurality of sentence vectors; and the semantics of the short sentences corresponding to the sentence vectors in the clustering cluster meet a preset semantic approximate condition.
It should be noted that, in the specific implementation process of the present invention, the sentence vectors are clustered by using a mode including, but not limited to, a K-means Clustering model, i.e., a K-means Clustering Algorithm (Kmeans Clustering Algorithm), which is not limited herein.
As shown in fig. 2, the emotion result obtained by migration learning based on the Bert model is clustered based on a preset clustering algorithm to obtain a plurality of clusters, and the short sentence semantics in each cluster are similar. Due to the transfer learning in step S102, the vectorization model not only can recognize the general semantics of chinese, but also can be more sensitive to the words of specific commodities. Therefore, the short sentences can be clustered into corresponding cluster clusters by selecting the preset hyper-parameter, namely the cluster number n. Finally, common commodity attribute characteristics can be extracted by analyzing the short sentences in each cluster and are used as the common attribute characteristics of the commodities corresponding to the short sentences in the cluster; and determining the commodity attribute characteristics corresponding to the commodity attribute numbers (namely sku1 and sku2) based on the commodity public attribute characteristics corresponding to the cluster of the sentence vectors and the mapping relation.
In a specific implementation process, the commodity attribute mining method provided by the invention can be practically applied to an attribute mining task of commodities such as an electric toothbrush and an electric shaver, and the type of the commodities which are practically applied is not specifically limited. Taking the electric toothbrush as an example, if the obtained product attribute number is MS 12446103; collecting commodity detail pages of the electric toothbrush, and identifying phrases used for describing the attributes of the electric toothbrush in the commodity detail page pictures based on an optical character recognition technology, wherein the phrases include but are not limited to: the duration is 60 hours, the duration is improved by 50 percent, the colors are pink and blue, the vibration frequency reaches 31000 minutes, about 40000 minutes high-speed sound wave vibration and high-efficiency sound wave vibration through sound wave technology, and the like; after the mapping relation between the commodity attribute numbers and the short sentences is determined, inputting the short sentences into a vectorization model obtained by fine tuning based on labeled electric toothbrush attribute data in advance, and outputting corresponding sentence vectors; clustering sentence vectors, for example, converging sentence vectors corresponding to short sentences with the semantics of vibration frequency reaching 31000 minutes by sound wave technology, high-speed sound wave vibration about 40000 minutes, high-efficiency sound wave vibration and the like used for expressing the vibration frequency characteristic of the electric toothbrush into corresponding cluster clusters, extracting the common attribute characteristic of commodities corresponding to the cluster clusters into high-frequency sound waves, and determining the attribute characteristic of the commodities corresponding to the attribute numbers of the commodities as the high-frequency sound waves according to the constructed mapping relation and the commodity attribute number MS12446103 of the electric toothbrush; converging sentence vectors corresponding to short sentences with the semantics of 60 hours of endurance time, 50% improvement of endurance time and the like as used for expressing the endurance time characteristics of the electric toothbrush into corresponding cluster, wherein the commodity public attribute characteristics corresponding to the cluster can be extracted as ultra-long endurance, and then determining the commodity attribute characteristics corresponding to the commodity attribute numbers as ultra-long endurance according to the constructed mapping relation and the commodity attribute numbers MS12446103 of the electric toothbrush; the number n of clusters is 2.
By adopting the commodity attribute mining method disclosed by the embodiment of the disclosure, a sentence vector which is output by a vectorization model and represents the semantics of a short sentence is obtained by determining the mapping relation between a commodity attribute number and the short sentence which is used for describing the commodity attribute in a commodity detail page picture and inputting the short sentence into a preset vectorization model; clustering the sentence vectors, and determining commodity attribute characteristics corresponding to the commodity attribute numbers based on the commodity public attribute characteristics corresponding to the clustering clusters of the sentence vectors and the mapping relation; the vectorization model is obtained by performing transfer learning training on a short sentence set of labeled attribute data corresponding to the target commodity class on the basis of a pre-training network model. The vectorization is established in a transfer learning mode, marked short sentence information can be fully utilized, so that the vectorization model effect is more suitable for the current commodity class, the unstructured data is effectively analyzed, the hidden attribute characteristics of the commodity are fully excavated, and the commodity attribute excavation efficiency is improved.
Corresponding to the commodity attribute mining method, the commodity attribute mining method also provides a commodity attribute mining device. Since the embodiment of the device is similar to the above method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the description of the above method embodiment, and the embodiment of the commodity attribute mining device described below is only illustrative. Fig. 3 is a schematic structural diagram of a product attribute mining device according to an embodiment of the present disclosure.
The commodity attribute excavating device specifically comprises the following parts:
a mapping relation determining unit 301, configured to determine a mapping relation between the product attribute number and a short sentence in the product detail page picture for describing the product attribute.
Specifically, the product attribute number, i.e., the inventory management information, refers to a numeric code or an alphabetic code assigned to a product for uniquely identifying the product attribute, so that an enterprise can manage inventory more easily and efficiently. The commodity detail page picture is a picture containing commodity detail information on an online sales counter, and the commodity detail information is unstructured information. For each item, in addition to the structured information in the database describing the characteristics of the item, some unstructured information is available, the most important of which is the information in the item detail page picture. Each commodity attribute number corresponds to at least one commodity detail page picture, each commodity detail page picture comprises a plurality of advertisement sentences or explanatory sentences, and each sentence describes the commodity attribute characteristics of a certain dimensionality of a commodity. The short sentence is the advertisement sentences or the explanatory sentences, and each commodity detail page picture corresponds to at least one short sentence describing the commodity attribute.
In a specific implementation process, the mapping relationship determining unit 301 first needs to map the product detail page picture to the product attribute number to obtain an initial mapping relationship between the product attribute number and the product detail page picture. Further, the mapping relationship determining unit 301 identifies the text information of the product detail page picture, obtains a short sentence corresponding to the product detail page picture and describing the product attribute, and determines the mapping relationship among the product attribute number, the product detail page picture and the short sentence describing the product attribute according to the initial mapping relationship and the short sentence corresponding to the product detail page picture and describing the product attribute.
A vectorization processing unit 302, configured to input the short sentence into a preset vectorization model, and obtain a sentence vector that represents the semantic meaning of the short sentence and is output by the vectorization model.
The vectorization model is obtained by training a short sentence set of labeled attribute data corresponding to the target commodity class on the basis of a pre-training network model;
and a commodity attribute feature determining unit 303, configured to determine, based on the commodity public attribute feature corresponding to the cluster of the sentence vector and the mapping relationship, a commodity attribute feature corresponding to the commodity attribute number.
Further, the commodity attribute feature determining unit is specifically configured to:
clustering the sentence vectors to obtain corresponding clustering clusters;
extracting commodity attribute features of the short sentence commonalities in the clustering cluster, and taking the commodity attribute features of the commonalities as the commodity public attribute features corresponding to the clustering cluster;
and matching the commodity public attribute characteristics corresponding to the cluster with the mapping relation to obtain the commodity attribute characteristics of the commodity attribute numbers.
Further, the clustering the sentence vectors to obtain corresponding clustering clusters specifically includes: based on the semantics of the sentence corresponding to the sentence vector, clustering the sentence vector by using a K-means clustering model to obtain a clustering cluster containing a plurality of sentence vectors; and the semantics of the short sentences corresponding to the sentence vectors in the clustering cluster meet a preset semantic approximate condition.
Further, the mapping relationship determining unit is specifically configured to:
mapping the commodity detail page picture to the commodity attribute number to obtain an initial mapping relation between the commodity attribute number and the commodity detail page picture;
recognizing the text information of the commodity detail page picture by using an optical character recognition mode to obtain a short sentence which is corresponding to the commodity detail page picture and describes the commodity attribute;
and determining the mapping relation according to the initial mapping relation and a short sentence which is corresponding to the commodity detail page picture and describes the commodity attribute.
Further, the product attribute mining device further includes:
a sample set determining unit, configured to gather, in a space, sentence vectors corresponding to similar phrases in the phrases based on a preset triple loss function, and to keep away, in the space, sentence vectors corresponding to non-similar phrases in the phrases, and to add corresponding pseudo labels to sample sentences not labeled with attribute data in the phrases according to spatial distances, so as to determine a training sample set of the pre-training network model;
and the model training unit is used for training the pre-training network model based on the training sample set.
Further, the commodity attribute number corresponds to at least one commodity detail page picture, and the commodity detail page picture corresponds to at least one short sentence describing the commodity attribute.
Further, the product attribute number is stock management information of the product, and the stock management information refers to a numeric code or an alphabetic code for uniquely identifying the product.
By adopting the commodity attribute mining device disclosed by the embodiment of the disclosure, a sentence vector which is output by a vectorization model and represents the semantics of a short sentence is obtained by determining the mapping relation between a commodity attribute number and the short sentence which is used for describing the commodity attribute in a commodity detail page picture and inputting the short sentence into a preset vectorization model; clustering the sentence vectors, and determining commodity attribute characteristics corresponding to the commodity attribute numbers based on the commodity public attribute characteristics corresponding to the clustering clusters of the sentence vectors and the mapping relation; the vectorization model is obtained by performing transfer learning training on a short sentence set of labeled attribute data corresponding to the target commodity class on the basis of a pre-training network model. The vectorization is established in a transfer learning mode, marked short sentence information can be fully utilized, so that the vectorization model effect is more suitable for the current commodity class, the unstructured data is effectively analyzed, the hidden attribute characteristics of the commodity are fully excavated, and the commodity attribute excavation efficiency is improved.
Corresponding to the commodity attribute mining method, the electronic equipment is further provided. Since the embodiment of the electronic device is similar to the above method embodiment, the description is simple, and please refer to the description of the above method embodiment, and the electronic device described below is only schematic. Fig. 4 is a schematic physical structure diagram of an electronic device disclosed in the embodiment of the present disclosure. The electronic device may include: a processor (processor)401, a memory (memory)402 and a communication bus 403, wherein the processor 401 and the memory 402 communicate with each other through the communication bus 403 and communicate with the outside through a communication interface 404. Processor 401 may invoke logic instructions in memory 402 to perform a commodity property mining method comprising: determining a mapping relation between the commodity attribute number and a short sentence describing the commodity attribute; wherein the short sentence is in the commodity detail page picture; inputting the short sentence into a preset vectorization model to obtain a sentence vector which is output by the vectorization model and represents the semantics of the short sentence; the vectorization model is obtained by training a short sentence set of labeled attribute data corresponding to the target commodity class on the basis of a pre-training network model; and determining the commodity attribute characteristics corresponding to the commodity attribute numbers based on the commodity public attribute characteristics corresponding to the cluster of the sentence vectors and the mapping relation.
Furthermore, the logic instructions in the memory 402 may be implemented in software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a Memory chip, a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
In another aspect, the present disclosure also provides a computer program product, which includes a computer program stored on a processor-readable storage medium, where the computer program includes program instructions, and when the program instructions are executed by a computer, the computer can execute the method for mining the property of the commodity provided by the above-mentioned method embodiments. The method comprises the following steps: determining a mapping relation between the commodity attribute number and a short sentence describing the commodity attribute; wherein the short sentence is in the commodity detail page picture; inputting the short sentence into a preset vectorization model to obtain a sentence vector which is output by the vectorization model and represents the semantics of the short sentence; the vectorization model is obtained by training a short sentence set of labeled attribute data corresponding to the target commodity class on the basis of a pre-training network model; and determining the commodity attribute characteristics corresponding to the commodity attribute numbers based on the commodity public attribute characteristics corresponding to the cluster of the sentence vectors and the mapping relation.
In another aspect, the disclosed embodiments also provide a processor-readable storage medium, where a computer program is stored on the processor-readable storage medium, and when executed by a processor, the computer program is implemented to perform the commodity attribute mining method provided by each of the above embodiments. The method comprises the following steps: determining a mapping relation between the commodity attribute number and a short sentence describing the commodity attribute; wherein the short sentence is in the commodity detail page picture; inputting the short sentence into a preset vectorization model to obtain a sentence vector which is output by the vectorization model and represents the semantics of the short sentence; the vectorization model is obtained by training a short sentence set of labeled attribute data corresponding to the target commodity class on the basis of a pre-training network model; and determining the commodity attribute characteristics corresponding to the commodity attribute numbers based on the commodity public attribute characteristics corresponding to the cluster of the sentence vectors and the mapping relation.
The processor-readable storage medium can be any available medium or data storage device that can be accessed by a processor, including, but not limited to, magnetic memory (e.g., floppy disks, hard disks, magnetic tape, magneto-optical disks (MOs), etc.), optical memory (e.g., CDs, DVDs, BDs, HVDs, etc.), and semiconductor memory (e.g., ROMs, EPROMs, EEPROMs, non-volatile memory (NAND FLASH), Solid State Disks (SSDs)), etc.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solutions of the present disclosure, not to limit them; although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present disclosure.

Claims (10)

1. A commodity attribute mining method is characterized by comprising the following steps:
determining a mapping relation between the commodity attribute number and a short sentence describing the commodity attribute; wherein the short sentence is in the commodity detail page picture;
inputting the short sentence into a preset vectorization model to obtain a sentence vector which is output by the vectorization model and represents the semantics of the short sentence;
the vectorization model is obtained by training a short sentence set of labeled attribute data corresponding to the target commodity class on the basis of a pre-training network model;
and determining the commodity attribute characteristics corresponding to the commodity attribute numbers based on the commodity public attribute characteristics corresponding to the cluster of the sentence vectors and the mapping relation.
2. The product attribute mining method according to claim 1, wherein determining the product attribute features corresponding to the product attribute numbers based on the product public attribute features corresponding to the cluster of the sentence vectors and the mapping relationship specifically comprises:
clustering the sentence vectors to obtain corresponding clustering clusters;
extracting commodity attribute features of the short sentence commonalities in the clustering cluster, and taking the commodity attribute features of the commonalities as the commodity public attribute features corresponding to the clustering cluster;
and matching the commodity public attribute characteristics corresponding to the cluster with the mapping relation to obtain the commodity attribute characteristics of the commodity attribute numbers.
3. The product attribute mining method according to claim 2, wherein the clustering the sentence vectors to obtain corresponding cluster specifically comprises:
based on the semantics of the sentence corresponding to the sentence vector, clustering the sentence vector by using a K-means clustering model to obtain a clustering cluster containing a plurality of sentence vectors; and the semantics of the short sentences corresponding to the sentence vectors in the clustering cluster meet a preset semantic approximate condition.
4. The product attribute mining method according to claim 1, wherein the determining of the mapping relationship between the product attribute number and the short sentence describing the product attribute specifically comprises:
mapping the commodity detail page picture to the commodity attribute number to obtain an initial mapping relation between the commodity attribute number and the commodity detail page picture;
recognizing the text information of the commodity detail page picture by using an optical character recognition mode to obtain a short sentence which is corresponding to the commodity detail page picture and describes the commodity attribute;
and determining the mapping relation between the determined commodity attribute number and the short sentence according to the initial mapping relation and the short sentence which is corresponding to the commodity detail page picture and describes the commodity attribute.
5. The product attribute mining method of claim 1, further comprising: based on a preset triple loss function, gathering sentence vectors corresponding to similar phrases in the phrases in a space, keeping away the sentence vectors corresponding to non-similar phrases in the space, and adding corresponding pseudo marks according to the spatial distance for sample sentences not labeled with attribute data in the phrases so as to determine a training sample set of the pre-training network model;
training the pre-training network model based on the training sample set.
6. The product attribute mining method of claim 1, further comprising: and training the pre-training network model by using a transfer learning mode based on the short sentence set marked with the attribute data to obtain the vectorization model.
7. The product attribute mining method according to any one of claims 1 to 6, wherein the product attribute number is stock management information of a product, and the stock management information refers to a numeric code or an alphabetic code for uniquely identifying the product.
8. An article attribute excavating device characterized by comprising:
the mapping relation determining unit is used for determining the mapping relation between the commodity attribute number and the short sentence describing the commodity attribute; wherein the short sentence is in the commodity detail page picture;
the vectorization processing unit is used for inputting the short sentence into a preset vectorization model to obtain a sentence vector which is output by the vectorization model and represents the semantic meaning of the short sentence;
the vectorization model is obtained by training a short sentence set of labeled attribute data corresponding to the target commodity class on the basis of a pre-training network model;
and the commodity attribute characteristic determining unit is used for determining the commodity attribute characteristics corresponding to the commodity attribute numbers based on the commodity public attribute characteristics corresponding to the clustering clusters of the sentence vectors and the mapping relation.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps of the method of mining the attributes of a commodity according to any one of claims 1 to 7.
10. A processor-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for mining properties of an item of merchandise according to any one of claims 1 to 7.
CN202111076600.2A 2021-09-14 2021-09-14 Commodity attribute mining method and device Active CN113724055B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111076600.2A CN113724055B (en) 2021-09-14 2021-09-14 Commodity attribute mining method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111076600.2A CN113724055B (en) 2021-09-14 2021-09-14 Commodity attribute mining method and device

Publications (2)

Publication Number Publication Date
CN113724055A true CN113724055A (en) 2021-11-30
CN113724055B CN113724055B (en) 2024-04-09

Family

ID=78683691

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111076600.2A Active CN113724055B (en) 2021-09-14 2021-09-14 Commodity attribute mining method and device

Country Status (1)

Country Link
CN (1) CN113724055B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114169966A (en) * 2021-12-08 2022-03-11 海南港航控股有限公司 Method and system for extracting unit data of goods by tensor

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011227886A (en) * 2010-03-30 2011-11-10 Rakuten Inc Commodity information providing system, commodity information providing method, and program
CN106067132A (en) * 2016-05-27 2016-11-02 乐视控股(北京)有限公司 The method to set up of item property and device
CN106408321A (en) * 2015-07-31 2017-02-15 华为技术有限公司 Management method and device of commodity template, and method and device for calling database, and system
CN107203548A (en) * 2016-03-17 2017-09-26 阿里巴巴集团控股有限公司 Attribute acquisition methods and device
CN107679247A (en) * 2017-10-31 2018-02-09 南威软件股份有限公司 A kind of method that electric business website realizes self-defined maintenance items extension information
CN109670066A (en) * 2018-12-11 2019-04-23 江西师范大学 A kind of Freehandhand-drawing formula toggery image search method based on dual path Deep Semantics network
CN110347835A (en) * 2019-07-11 2019-10-18 招商局金融科技有限公司 Text Clustering Method, electronic device and storage medium
CN110490682A (en) * 2018-05-15 2019-11-22 北京京东尚科信息技术有限公司 The method and apparatus for analyzing item property
CN111401409A (en) * 2020-02-28 2020-07-10 创新奇智(青岛)科技有限公司 Commodity brand feature acquisition method, sales volume prediction method, device and electronic equipment
CN113065882A (en) * 2020-01-02 2021-07-02 阿里巴巴集团控股有限公司 Commodity processing method and device and electronic equipment

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011227886A (en) * 2010-03-30 2011-11-10 Rakuten Inc Commodity information providing system, commodity information providing method, and program
CN106408321A (en) * 2015-07-31 2017-02-15 华为技术有限公司 Management method and device of commodity template, and method and device for calling database, and system
CN107203548A (en) * 2016-03-17 2017-09-26 阿里巴巴集团控股有限公司 Attribute acquisition methods and device
CN106067132A (en) * 2016-05-27 2016-11-02 乐视控股(北京)有限公司 The method to set up of item property and device
CN107679247A (en) * 2017-10-31 2018-02-09 南威软件股份有限公司 A kind of method that electric business website realizes self-defined maintenance items extension information
CN110490682A (en) * 2018-05-15 2019-11-22 北京京东尚科信息技术有限公司 The method and apparatus for analyzing item property
CN109670066A (en) * 2018-12-11 2019-04-23 江西师范大学 A kind of Freehandhand-drawing formula toggery image search method based on dual path Deep Semantics network
CN110347835A (en) * 2019-07-11 2019-10-18 招商局金融科技有限公司 Text Clustering Method, electronic device and storage medium
CN113065882A (en) * 2020-01-02 2021-07-02 阿里巴巴集团控股有限公司 Commodity processing method and device and electronic equipment
CN111401409A (en) * 2020-02-28 2020-07-10 创新奇智(青岛)科技有限公司 Commodity brand feature acquisition method, sales volume prediction method, device and electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114169966A (en) * 2021-12-08 2022-03-11 海南港航控股有限公司 Method and system for extracting unit data of goods by tensor
CN114169966B (en) * 2021-12-08 2022-08-05 海南港航控股有限公司 Method and system for extracting unit data of goods by tensor

Also Published As

Publication number Publication date
CN113724055B (en) 2024-04-09

Similar Documents

Publication Publication Date Title
CN108399228B (en) Article classification method and device, computer equipment and storage medium
US8682896B2 (en) Smart attribute classification (SAC) for online reviews
CN112487149B (en) Text auditing method, model, equipment and storage medium
CN111339260A (en) BERT and QA thought-based fine-grained emotion analysis method
CN111783993A (en) Intelligent labeling method and device, intelligent platform and storage medium
CN112364628B (en) New word recognition method and device, electronic equipment and storage medium
CN111782793A (en) Intelligent customer service processing method, system and equipment
CN111462752A (en) Client intention identification method based on attention mechanism, feature embedding and BI-L STM
CN113010678A (en) Training method of classification model, text classification method and device
CN115600109A (en) Sample set optimization method and device, equipment, medium and product thereof
JP7291419B2 (en) Method and apparatus for providing information about machine learning-based similar items
CN115481355A (en) Data modeling method based on category expansion
CN111651597A (en) Multi-source heterogeneous commodity information classification method based on Doc2Vec and convolutional neural network
CN109409529B (en) Event cognitive analysis method, system and storage medium
CN113869609A (en) Method and system for predicting confidence of frequent subgraph of root cause analysis
CN113724055B (en) Commodity attribute mining method and device
CN117034948B (en) Paragraph identification method, system and storage medium based on multi-feature self-adaptive fusion
CN117252186A (en) XAI-based information processing method, device, equipment and storage medium
CN114970553B (en) Information analysis method and device based on large-scale unmarked corpus and electronic equipment
CN115168567B (en) Knowledge graph-based object recommendation method
CN111126038A (en) Information acquisition model generation method and device and information acquisition method and device
CN116127013A (en) Personal sensitive information knowledge graph query method and device
CN115953217A (en) Commodity grading recommendation method and device, equipment, medium and product thereof
CN115660695A (en) Customer service personnel label portrait construction method and device, electronic equipment and storage medium
CN112948561B (en) Method and device for automatically expanding question-answer knowledge base

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant