CN115860007A

CN115860007A - Index influence degree calculation method and device, storage medium and electronic equipment

Info

Publication number: CN115860007A
Application number: CN202310111723.8A
Authority: CN
Inventors: 肖荣昌; 张家栋; 许先才; 熊磊
Original assignee: Shenzhen Yunintegral Technology Co ltd
Current assignee: Shenzhen Yunintegral Technology Co ltd
Priority date: 2023-02-14
Filing date: 2023-02-14
Publication date: 2023-03-28
Anticipated expiration: 2043-02-14
Also published as: CN115860007B

Abstract

The invention discloses a method and a device for calculating index influence degree, a storage medium and electronic equipment, wherein the method comprises the following steps: acquiring a commodity description picture of a target commodity; extracting an unordered text list from the commodity description picture, and adjusting the unordered text list into an ordered text list according to a text distance to obtain text semantic information of the commodity description picture; acquiring a term dictionary of the target commodity, and identifying term words in the text semantic information by adopting the term dictionary and a term extraction model; and acquiring the sales index data of the target commodity, and respectively calculating the influence degree of the plurality of term words on the sales index data. According to the invention, the readability of the commodity description pictures is improved, the junk texts in the commodity description pictures are reduced, the text quality of the commodity description pictures is improved, and the technical problem that the influence degree of the description terms of the commodities on the sales index data cannot be accurately calculated in the related technology is solved.

Description

Index influence degree calculation method and device, storage medium and electronic equipment

Technical Field

The invention relates to the technical field of computers, in particular to a method and a device for calculating index influence degree, a storage medium and electronic equipment.

Background

In the e-commerce marketing documents of the related technology, most of the industry or brand terms and words are mainly concentrated in the commodity picture detail pages, in the aspect of term extraction technology, most of the prior technologies mainly adopt an unsupervised mode to extract terms, then a term network is constructed by utilizing the context semantic relationship, and how to accurately extract the specific category of terms and words in the pictures is less involved.

In the related technology, in order to improve sales indexes such as sales volume of commodities, descriptors in a picture detail page can be continuously adjusted and updated, but how to calculate the influence degree of each term on the indexes is specifically calculated, the related technology only can estimate and compare the terms by people, and cannot calculate accurately.

In view of the above problems in the related art, no effective solution has been found so far.

Disclosure of Invention

The embodiment of the invention provides a method and a device for calculating index influence degree, a storage medium and electronic equipment.

According to an aspect of the embodiments of the present application, there is provided a method for calculating an index influence degree, including: acquiring a commodity description picture of a target commodity; extracting an unordered text list from the commodity description picture, and adjusting the unordered text list into an ordered text list according to a text distance to obtain text semantic information of the commodity description picture; acquiring a term dictionary of the target commodity, and identifying term words in the text semantic information by adopting the term dictionary and a term extraction model; and acquiring sales index data of the target commodity, and respectively calculating influence degrees of the plurality of term words on the sales index data.

Further, the acquiring of the product description picture of the target product includes: acquiring a Uniform Resource Locator (URL) address of a commodity description picture of the target commodity; and calling a downloading tool to download the commodity description picture from the URL address.

Further, extracting an unordered text list from the commodity description picture, and adjusting the unordered text list into an ordered text list according to a text distance, wherein obtaining text semantic information of the commodity description picture comprises: extracting a plurality of text characters and corresponding text coordinates in the commodity description picture by adopting an Optical Character Recognition (OCR) tool; generating a unordered text list using the plurality of text characters and corresponding text coordinates, wherein the unordered text list includes a plurality of text characters arranged randomly in a language order, and each element of the unordered text list includes: text characters, horizontal coordinates corresponding to the text characters, and vertical coordinates corresponding to the text characters; and generating text semantic information of the commodity description picture by adopting the unordered text list, wherein the text semantic information comprises a plurality of text characters which are orderly arranged according to scene semantics.

Further, generating the text semantic information of the commodity description picture by using the unordered text list comprises: extracting an initial text in the unordered text list according to text coordinates, and determining the initial text as a target text; iteratively performing the following steps until the unordered text list is empty: calculating the center distance between the target text and all the remaining texts in the unordered text list; judging whether a first similar text with the center distance smaller than a preset threshold exists in the first adjacent direction; if a first similar text with the center distance smaller than a preset threshold value exists in a first adjacent direction, extracting the first similar text from the unordered text list, splicing the first similar text behind a target text, and updating the current target text into the first similar text; if the first adjacent direction does not have the first similar text with the central distance smaller than the preset threshold value, judging whether the second adjacent direction has the second similar text with the central distance smaller than the preset threshold value; if a second near text with the center distance smaller than a preset threshold value exists in a second adjacent direction, extracting the second near text from the unordered text list, splicing the second near text behind a target text, and updating the current target text into the second near text; if a second adjacent text with the center distance smaller than a preset threshold value does not exist in a second adjacent direction, extracting the initial text in the first adjacent direction, determining the initial text in the first adjacent direction as a third adjacent text of the target text, splicing the third adjacent text behind the target text, and updating the current target text into the third adjacent text; and determining the spliced text sequence as the text semantic information of the commodity description picture.

Further, identifying term words in the text semantic information using the term dictionary and term extraction model includes: traversing the text semantic information by using keywords in the term dictionary to obtain a plurality of first target terms matched with corresponding keywords, wherein the term dictionary comprises the keywords of the plurality of terms; labeling the first target term as label information corresponding to a hit field in the text semantic information to obtain sample data; training a sequence to a sequence model by adopting the sample data to obtain a term extraction model; extracting a plurality of second target terms in the text semantic information by adopting the term extraction model; and fusing the plurality of first target terms and the plurality of second target terms to obtain the term words in the text semantic information.

Further, acquiring sales index data of the target commodity, and calculating influence degrees of the plurality of term words on the sales index data respectively includes: acquiring first sales data of a first target commodity and second sales data of a second target commodity, wherein the sales index data comprises the sales data, and the target commodity comprises the first target commodity and the second target commodity; determining a first term set of the first target commodity and determining a second term set of the second target commodity; assigning each first term in the first set of terms using the first sales data, assigning each second term in the second set of terms based on using the second sales data; determining whether a same third term exists in the first term set and the second term set; assigning a third term to the first term set and the second term set using an average of the first sales data and the second sales data if the third term is the same as the first term set and the second term set; the first term, the second term, and the third term are ordered based on sales data to obtain a first sequence of degrees of influence.

Further, the target commodity includes a third target commodity and a fourth target commodity, acquiring sales index data of the target commodity, and calculating influence degrees of a plurality of term words on the sales index data respectively includes: acquiring third sales data of a third target commodity and fourth sales data of a fourth target commodity, wherein the sales index data comprises the sales data; determining a third term set of the third target good and determining a fourth term set of the fourth target good; assigning each term in the third set of terms with the third sales data based on assigning each term in the fourth set of terms with the fourth sales data; filtering terms which are the same as the third term set and the fourth term set to obtain a fifth term set; and sequencing each term in the fifth term set based on the sales data to obtain a second influence degree sequence.

According to another aspect of the embodiments of the present application, there is also provided an index influence degree calculation apparatus, including: the acquisition module is used for acquiring a commodity description picture of a target commodity; the extraction module is used for extracting the unordered text list from the commodity description picture, and adjusting the unordered text list into an ordered text list according to the text distance to obtain text semantic information of the commodity description picture; the recognition module is used for acquiring a term dictionary of the target commodity and recognizing the term words in the text semantic information by adopting the term dictionary and a term extraction model; and the calculation module is used for acquiring the sales index data of the target commodity and calculating the influence degree of the plurality of term words on the sales index data respectively.

Further, the obtaining module includes: the acquisition unit is used for acquiring a Uniform Resource Locator (URL) address of the commodity description picture of the target commodity; and the downloading unit is used for calling a downloading tool to download the commodity description picture from the URL address.

Further, the extraction module comprises: the extraction unit is used for extracting a plurality of text characters and corresponding text coordinates in the commodity description picture by adopting an Optical Character Recognition (OCR) tool; a first generating unit, configured to generate a unordered text list using the text characters and corresponding text coordinates, where the unordered text list includes text characters in a plurality of language orders, and each element of the unordered text list includes: text characters, horizontal coordinates corresponding to the text characters, and vertical coordinates corresponding to the text characters; and the second generating unit is used for generating text semantic information of the commodity description picture by adopting the unordered text list, wherein the text semantic information comprises a plurality of text characters which are orderly arranged according to scene semantics.

Further, the second generation unit includes: the extraction subunit is used for extracting a starting text in the unordered text list according to text coordinates, and determining the starting text as a target text; an iteration subunit configured to iteratively perform the following steps until the unordered text list is empty: calculating the center distance between the target text and all the remaining texts in the unordered text list; judging whether a first similar text with the center distance smaller than a preset threshold exists in the first adjacent direction; if a first similar text with the center distance smaller than a preset threshold value exists in a first adjacent direction, extracting the first similar text from the unordered text list, splicing the first similar text behind a target text, and updating the current target text into the first similar text; if the first adjacent direction does not have the first similar text with the central distance smaller than the preset threshold value, judging whether the second adjacent direction has the second similar text with the central distance smaller than the preset threshold value; if a second near text with the center distance smaller than a preset threshold value exists in a second adjacent direction, extracting the second near text from the unordered text list, splicing the second near text behind a target text, and updating the current target text into the second near text; if a second adjacent text with the center distance smaller than a preset threshold value does not exist in a second adjacent direction, extracting the initial text in the first adjacent direction, determining the initial text in the first adjacent direction as a third adjacent text of the target text, splicing the third adjacent text behind the target text, and updating the current target text into the third adjacent text; and the determining subunit is used for determining the spliced text sequence as the text semantic information of the commodity description picture.

Further, the identification module includes: a traversal unit, configured to traverse the text semantic information by using a keyword in the term dictionary to obtain a plurality of first target terms matched with corresponding keywords, where the term dictionary includes the keywords of a plurality of terms; the labeling unit is used for labeling the first target term as label information corresponding to a hit field in the text semantic information to obtain sample data; the training unit is used for training a sequence to a sequence model by adopting the sample data to obtain a term extraction model; the extraction unit is used for extracting a plurality of second target terms in the text semantic information by adopting the term extraction model; and the fusion unit is used for fusing the plurality of first target terms and the plurality of second target terms to obtain the term words in the text semantic information.

Further, the calculation module includes: a first acquisition unit configured to acquire first sales data of a first target commodity and second sales data of a second target commodity, wherein the sales index data includes the sales data, and the target commodity includes the first target commodity and the second target commodity; a first determining unit, configured to determine a first term set of the first target product and determine a second term set of the second target product; a first assigning unit configured to assign each first term in the first term set using the first sales data, and assign each second term in the second term set based on using the second sales data; a first judging unit, configured to judge whether a same third term exists in the first term set and the second term set; a second assigning unit, configured to assign a third term to the first term set and the second term set by using a mean value of the first sales data and the second sales data if the first term set and the second term set have the same third term; a first ordering unit, configured to order the first term, the second term, and the third term based on the sales data to obtain a first sequence of influence degrees.

Further, the target commodities comprise a third target commodity and a fourth target commodity, and the calculation module comprises: a second acquisition unit configured to acquire third sales data of a third target commodity and fourth sales data of a fourth target commodity, wherein the sales index data includes the sales data; a second determining unit, configured to determine a third term set of the third target product and a fourth term set of the fourth target product; a third assigning unit configured to assign each term in the third set of terms using the third sales data based on assigning each term in the fourth set of terms using the fourth sales data; a filtering unit, configured to filter terms that are the same as the third term set and the fourth term set, so as to obtain a fifth term set; and the second sorting unit is used for sorting each term in the fifth term set based on the sales data to obtain a second influence degree sequence.

According to another aspect of the embodiments of the present application, there is also provided a storage medium including a stored program that executes the above steps when the program is executed.

According to another aspect of the embodiments of the present application, there is also provided an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus; wherein: a memory for storing a computer program; a processor for executing the program stored in the memory to execute the steps of the method.

Embodiments of the present application also provide a computer program product containing instructions, which when run on a computer, cause the computer to perform the steps of the above method.

According to the invention, a commodity description picture of a target commodity is obtained, an unordered text list is extracted from the commodity description picture, the unordered text list is adjusted into an ordered text list according to a text distance, text semantic information of the commodity description picture is obtained, a term dictionary of the target commodity is obtained, term words in the text semantic information are identified by adopting the term dictionary and a term extraction model, sales index data of the target commodity are obtained, the influence degree of a plurality of term words on the sales index data is respectively calculated, the commodity description picture of the target commodity is obtained, the text semantic information is extracted from the commodity description picture, the term words in the commodity description picture are identified by adopting the term dictionary and the model, and the influence degree of the term words on the sales index data of the target commodity is calculated.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a block diagram of a hardware configuration of a computer according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method of index influence calculation according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating text arrangement in an embodiment of the invention;

FIG. 4 is a flowchart of OCR text recognition in an embodiment of the present invention;

FIG. 5 is a flowchart of text annotation by calling doccano according to an embodiment of the present invention;

FIG. 6 is a flow chart of term associated product data indicator analysis according to an embodiment of the present invention;

fig. 7 is a block diagram of a calculation apparatus for index influence degree according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort shall fall within the protection scope of the present application. It should be noted that, in the present application, the embodiments and features of the embodiments may be combined with each other without conflict.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be implemented in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

The method provided by the embodiment one of the present application may be executed in a server, a computer, a mobile phone, or a similar computing device. Taking an example of the present invention running on a computer, fig. 1 is a block diagram of a hardware structure of a computer according to an embodiment of the present invention. As shown in fig. 1, the computer may include one or more (only one shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA) and a memory 104 for storing data, and optionally, a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those skilled in the art that the configuration shown in FIG. 1 is illustrative only and is not intended to be limiting. For example, a computer may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

The memory 104 may be used to store a computer program, for example, a software program and a module of application software, such as a computer program corresponding to a method for calculating an index influence degree in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the computer program stored in the memory 104, so as to implement the method described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to a computer through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer. In one example, the transmission device 106 includes a network adapter (NIC) that can be connected to other network devices via a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

In this embodiment, a method for calculating an index influence degree is provided, and fig. 2 is a flowchart of a method for calculating an index influence degree according to an embodiment of the present invention, as shown in fig. 2, the flowchart includes the following steps:

step S202, acquiring a commodity description picture of a target commodity;

the commodity description picture in this embodiment may be obtained from an online or offline sales channel of the target commodity, such as a purchase page of an e-commerce platform (e.g., ali, kyotong, etc.), an official website, an evaluation list of a buyer, an brochure, and the like.

Alternatively, the target product may be a single product, or may be a plurality of products with the same or similar product attributes, such as products of the same manufacturer, products of the same category, products sold in the same area or channel, and the like.

Step S204, extracting a disordered text list from the commodity description picture, and adjusting the disordered text list into an ordered text list according to the text distance to obtain text semantic information of the commodity description picture;

the text semantic information in this embodiment is a long text of the text field Jing Yuyi in the restored commodity description picture.

Step S206, acquiring a term dictionary of the target commodity, and identifying term words in the text semantic information by adopting the term dictionary and the term extraction model;

step S208, obtaining the sales index data of the target commodity, and respectively calculating the influence degree of the plurality of term words on the sales index data.

Optionally, after the influence degrees of the plurality of term words on the sales index data are respectively calculated, the corresponding term words are marked by using the influence degrees, or the influence degrees are converted into recommendation degrees or priority levels (the recommendation degrees or the priority levels are positively correlated with the influence degrees), and when a user generates a new product description picture or adjusts a generated product description picture next time, a more appropriate term word can be selected based on the recommendation degrees or the priority levels.

Optionally, the sales index data may be sales data, click rate, exposure, buyback rate, goodness of comment, collection quantity, recommendation rate, etc., and the sales data is taken as an example in this embodiment for explanation.

Through the steps, the commodity description picture of the target commodity is obtained, the unordered text list is extracted from the commodity description picture, the unordered text list is adjusted to be the ordered text list according to the text distance, the text semantic information of the commodity description picture is obtained, the term dictionary of the target commodity is obtained, the term words in the text semantic information are identified by adopting the term dictionary and the term extraction model, the sales index data of the target commodity is obtained, the influence degree of a plurality of term words on the sales index data is respectively calculated, the commodity description picture of the target commodity is obtained, the text semantic information is extracted from the commodity description picture, the term words in the commodity description picture are identified by adopting the term dictionary and the term words in the commodity description picture are identified by adopting the term dictionary and the term extraction model, the influence degree of the term words on the sales index data of the target commodity is calculated, the accurate identification of the commodity description term words is achieved, the influence degree of the term words on the commodity sales index data is efficiently calculated, the term words with large influence degree can be screened and retained, the term words with low influence degree is reduced, the readability of the commodity description picture is improved, the junk texts in the commodity description picture is reduced, the text quality of the commodity description picture is improved, and the problem that the technology related technology of the technology index of the commodity description data of the commodity description of the commodity cannot be accurately calculated is solved.

In one embodiment of this embodiment, acquiring the product description picture of the target product includes: acquiring a Uniform Resource Locator (URL) address of a commodity description picture of a target commodity; and calling a downloading tool to download the commodity description picture from the URL address.

In one example, extracting the unordered text list from the commodity description picture, and adjusting the unordered text list into the ordered text list according to the text distance to obtain the text semantic information of the commodity description picture includes:

s11, extracting a plurality of text characters and corresponding text coordinates in the commodity description picture by adopting an Optical Character Recognition (OCR) tool;

s12, generating a disordered text list by adopting a plurality of text characters and corresponding text coordinates, wherein the disordered text list comprises a plurality of text characters which are randomly arranged in a language order, and each element of the disordered text list comprises: text characters, horizontal coordinates corresponding to the text characters, and vertical coordinates corresponding to the text characters;

the text character of each element in the unordered text list of the embodiment corresponds to one image frame in the commodity description picture, wherein the abscissa and the ordinate of the corresponding text character may be the abscissa and the ordinate of the center position of the image frame of the corresponding text character.

A coordinate system is constructed in a commodity description picture in advance, a boundary point of the commodity description picture on the upper left side is selected as an original point, the right side of the original point is in the positive direction of an X axis, and the lower side of the original point is in the positive direction of a Y axis.

Optionally, after the disordered text list is generated by using a plurality of text characters and corresponding text coordinates, the image layout parameters of the commodity description image, such as color, font and font size, may be analyzed, and the text sequence in the disordered text list is adjusted according to the image layout parameters, so that the text characters with the same image layout parameters are clustered, and the preliminary ordering of the text characters is realized.

And S13, generating text semantic information of the commodity description picture by adopting the unordered text list, wherein the text semantic information comprises a plurality of text characters which are orderly arranged according to scene semantics.

Optionally, generating the text semantic information of the commodity description picture by using the unordered text list includes: extracting an initial text in the unordered text list according to the text coordinates, and determining the initial text as a target text; iteratively performing the following steps until the unordered text list is empty: calculating the center distance between the target text and all the remaining texts in the unordered text list; judging whether a first similar text with the center distance smaller than a preset threshold exists in the first adjacent direction; if a first similar text with the center distance smaller than a preset threshold value exists in the first adjacent direction, extracting the first similar text from the unordered text list, splicing the first similar text behind the target text, and updating the current target text into the first similar text; if the first adjacent direction does not have the first similar text with the central distance smaller than the preset threshold value, judging whether the second adjacent direction has the second similar text with the central distance smaller than the preset threshold value; if a second near text with the center distance smaller than a preset threshold value exists in the second adjacent direction, extracting the second near text from the unordered text list, splicing the second near text behind the target text, and updating the current target text into the second near text; if the second adjacent text with the center distance smaller than the preset threshold value does not exist in the second adjacent direction, extracting the initial text in the first adjacent direction, determining the initial text in the first adjacent direction as a third adjacent text of the target text, splicing the third adjacent text behind the target text, and updating the current target text into the third adjacent text; and determining the spliced text sequence as text semantic information of the commodity description picture.

Optionally, the first adjacent direction is the right side of the target text, the second adjacent direction is the lower side of the target text, and the unordered text list is traversed from left to right and from top to bottom. The corresponding first adjacent direction and the second adjacent direction can be adapted based on the layout format of the commodity description picture and the like, so that the traversing distance and time are shortened.

Taking a first text in the unordered text list as a target text, calculating the center distance between the target text and all texts in each adjacent direction, taking the center distance smaller than a certain set threshold value as a similar text, splicing the center distance behind the target text, removing the first text and the similar text from the list to prevent repetition, taking the similar text as the target text, continuously searching whether all texts behind the text have the similar text, transferring the target text to the first text of a new list when no similar text exists, and continuously calculating the similar text until the list is empty.

In one example, each element in the identified unordered text list is: [ (text, center abscissa, center ordinate) ], such as including four elements, respectively: [ (a text, 100, 100), (B text, 150, 300), (C text, 200, 150), (D text, 150, 200) ], fig. 3 is a schematic diagram of text arrangement in the embodiment of the present invention, the threshold is 60, and the correct word order of the original text in the commodity description picture is ABCD, but OCR recognizes ACBD; the adjustment process using the detection adjustment algorithm at this time is as follows:

according to the text coordinates, selecting a text A with the smallest abscissa and ordinate (closest to the origin) as a starting text, setting the text A as a target text, popping up the text A, and performing traversal calculation on the text A and the text C, wherein the center distance between the text A and the text B is as follows: 100 50, when the distance between the A and the B meets a threshold value, the B is used as a similar text, a text B is popped up and spliced behind the A to form an AB text; b is used as a new target text, B and C are calculated, the distance between B and D is 50 x root number 5, 100 does not accord with a threshold value, a starting text at the right side of the target text is extracted from the rest disordered text list (the starting text is a text at the first column and the first line at the right side by traversing from left to right and from top to bottom), C is used as a new target, a C text is popped up and spliced after the AB text to form an ABC text, the distance between C and D is calculated to be 50, the threshold value is met to form an ABCD text, and the disordered text list is empty because no text exists after D, and the sequence of forming the ABCD text is output.

Figure 4 is a flow chart of OCR text recognition in an embodiment of the present invention.

In this embodiment, recognizing term words in text semantic information using the term dictionary and term extraction model includes: traversing text semantic information by using keywords in a term dictionary to obtain a plurality of first target terms matched with corresponding keywords, wherein the term dictionary comprises the keywords of a plurality of terms; labeling the first target term as label information corresponding to a hit field in text semantic information to obtain sample data; training a sequence to a sequence model by adopting sample data to obtain a term extraction model; extracting a plurality of second target terms in the text semantic information by adopting a term extraction model; and fusing the plurality of first target terms and the plurality of second target terms to obtain the term words in the text semantic information.

In some examples, the target term may also be extracted using only a term dictionary or a term extraction model. In the fusion, term words in the intersection may be determined as term words in the text semantic information by intersecting the sets of the plurality of first target terms and the sets of the plurality of second target terms.

Optionally, the term categories in the term dictionary include: products (efficacy, ingredients, packaging, method of use, marketing concepts, product name, place name, product characteristics), endorsements (credentials, experts, technology, sales, introduction), promotional measures (gifts, added value, strategy, red envelope, kill, discount on seconds, time-limited), pain spots, etc. For example, the term words for efficacy categories include: the key words of whitening, oil control, refreshing and skin care, acne mark removal and whitening comprise: "whiten", "white bright", "clear", and "clear", if any of the above fields are contained in a long text of textual semantic information, the textual semantic information includes the term "whiten".

In one example, a doccano tool is called, text semantic information after semantic reduction is uploaded to the doccano tool, term word categories are constructed in advance according to industries or brands, a doccano platform supports an automatic labeling function, and the automatic labeling module is designed by utilizing a heuristic method: according to the dictionary rules and the model prediction function, an automatic labeling module is constructed, a component button for adding new words is added, and the addition of the words which are not labeled into the dictionary is supported, so that the labeling efficiency can be greatly improved; and marking sample data, deriving the sample data, and directly training the term category identification model by using the sequence to sequence model to obtain a term extraction model without two independent steps of extracting terms and classifying according to semantics, namely combining term extraction and term classification in one step, so that the information of the two steps is intercommunicated, and the effect of training the model is improved.

Fig. 5 is a flowchart of invoking doccano for text annotation according to the embodiment of the present invention.

In an implementation scenario of this embodiment, the target product includes a first target product and a second target product, acquiring sales index data of the target product, and calculating influence degrees of a plurality of term words on the sales index data respectively includes: acquiring first sales data of a first target commodity and second sales data of a second target commodity, wherein the sales index data comprises sales data; determining a first term set of a first target commodity and determining a second term set of a second target commodity; assigning each first term in the first set of terms using the first sales data, based on assigning each second term in the second set of terms using the second sales data; judging whether the same third term exists in the first term set and the second term set; assigning a third term to the first term set and the second term set using an average of the first sales data and the second sales data if the same third term exists; the first term, the second term, and the third term are ranked based on the sales data to obtain a first sequence of influence.

In one example, commodity data indicators such as sales information are obtained for each month, and term importance is calculated. The calculation process comprises the following steps: acquiring a plurality of term words of a commodity, recording the commodity sales volume information of the plurality of term words of the commodity, averaging sales volumes of different commodities to obtain sales volume average information of different terms, and obtaining term word importance according to the descending order of the sales volumes.

For example, the sales volume of the product a is 1000, the terms of whitening, oil control and refreshing skin care are [ whitening, oil control and acne mark removal ], the sales volume of the product B is 2000, the terms of whitening, oil control and acne mark removal are [ whitening ], the average value of the term of sales volume is [ whitening ], [ oil control ], [ refreshing skin care ] 1000, [ acne mark removal ] 2000, the sales volume is reduced to [ acne mark removal, whitening, oil control and refreshing skin care ], the influence of the acne mark removal on the sales volume is the highest, and the influence of the refreshing skin care on the sales volume is the lowest.

In another implementation scenario of this embodiment, the target commodities include a third target commodity and a fourth target commodity, acquiring sales index data of the target commodities, and calculating influence degrees of the plurality of term words on the sales index data respectively includes: acquiring third sales data of a third target commodity and fourth sales data of a fourth target commodity, wherein the sales index data comprises sales data; determining a third term set of a third target commodity and determining a fourth term set of a fourth target commodity; assigning each term in the third set of terms using the third sales data based on assigning each term in the fourth set of terms using the fourth sales data; filtering terms which are the same as the third term set and the fourth term set to obtain a fifth term set; and sequencing each term in the fifth term set based on the sales data to obtain a second influence sequence.

In one example, a commodity data index, such as sales information, is obtained for each month, and the influence difference of different terms on the product is compared. The calculation process comprises the following steps: acquiring a plurality of term words of a commodity, recording the commodity sales information of the plurality of term words of the commodity, removing the same term words from different commodities, and only keeping different term words, so as to obtain the sales difference of the product primarily, wherein the sales difference is partially caused by different term words.

For example, the sales of the product a is 1000, the terms "whitening, oil control, refreshing and skin care" and the sales of the product B are 2000, the terms "whitening, oil control and acne mark" and the same terms are removed, the sales of the product a is 1000, the terms "refreshing and skin care" and the sales of the product B are 2000, and the terms "acne mark removal", the product information is obtained: keywords can influence consumers whether to buy products to a certain extent compared with keywords which are refreshing and skin-care.

According to the scheme of the embodiment, an OCR recognition technology is adopted, texts are extracted from pictures, and the original text scene semantics are restored according to the coordinate information of the original text; the method comprises the steps of establishing term category standards such as products, pain spots, marketing means, qualification endorsements and the like by using a sequence to sequence method according to industry characteristics, rapidly labeling sample data by using a labeling tool doccano, finely adjusting a term model, finally extracting specified term classification words of a specific industry from a text, combining term extraction and term classification in one step, fully fusing term word category information into model training, and directly correlating product index data analysis. FIG. 6 is a flowchart of the term associated product data index analysis according to an embodiment of the present invention, including:

OCR text recognition and scene semantic reduction, comprising:

the method comprises the following steps: acquiring picture url address information of a commodity detail page of a certain industry or a certain brand, acquiring pictures by using picture downloading tools requests, and storing the pictures to the local;

step two: recognizing the picture text by using an OCR recognition tool, extracting the text and the text coordinate information to obtain a text list without strict language sequence;

step three: taking the first text in the second list as a target text, calculating the center distance between the target text and all the following texts, taking the center distance smaller than a certain set threshold value as a similar text, splicing the similar text behind the target text, excluding the first text and the similar text from the list to prevent repetition, taking the similar text as the target text, continuously searching whether all the texts behind the target text have the similar text, if no similar text exists, transferring the target text to the first text of the new list, and continuously calculating the similar text until the list is empty;

step four: outputting and storing the picture text information after the semantics are restored;

and (3) doccano platform term labeling, heuristic automatic labeling, comprising:

the method comprises the following steps: uploading the text data with the reduced semantics to a doccano platform;

step two: the term word categories are constructed according to industry or brand: products (efficacy, ingredients, packaging, method of use, marketing concepts, product name, place name, product characteristics), endorsements (credentials, experts, technology, sales, introduction), promotional measures (gifts, added value, strategy, red envelope, kill-by-second, discount), pain spots, etc.;

step three: the doccano platform supports an automatic labeling function, and the invention utilizes a heuristic design automatic labeling module: according to the dictionary rule and the model prediction function, an automatic labeling module is constructed, and [ adding new words ] buttons are added to support adding of unmarked words into a dictionary, so that the labeling efficiency can be greatly improved;

step four: and marking sample data, deriving the sample data, and directly training the term category identification model by using a sequence to sequence model to obtain a term extraction model without two independent steps of extracting terms first and then classifying according to semantics, namely combining term extraction and term classification in one step to ensure that the information of the two steps is intercommunicated, thereby being beneficial to improving the effect of the training model.

The product data index (such as sales volume) analysis module related to commodity term words comprises:

the method comprises the following steps: extracting term words in a detail page picture of a commodity to be analyzed according to the trained term extraction model to obtain a plurality of term words corresponding to the commodity;

step two: and acquiring commodity data indexes such as sales information of each month, and calculating the importance of the terms. The specific calculation process is as follows: acquiring a plurality of term words of a commodity, recording the commodity sales volume information of the plurality of term words of the commodity, averaging sales volumes of different commodities to obtain sales volume average value information of different terms, and obtaining term word importance in descending order according to the sales volumes;

step three: and acquiring commodity data indexes such as sales volume information of each month, and comparing product differences. The specific calculation process is as follows: acquiring a plurality of term words of a commodity, recording the term words of the commodity into commodity sales information, removing the same term words from different commodities, and only keeping different term words to obtain the sales difference of the product primarily, wherein the sales difference is partially caused by different term words;

step four: and (4) exporting the importance sequencing of the analysis terms and the comparative analysis result of every two commodities, thereby facilitating the further analysis of operation and improving the marketing scheme.

By adopting the scheme of the embodiment, in the image OCR recognition, the center distance information is adopted to restore the image-text original text semantics, and the problems of image-text information separation and the like are solved; in the doccano term labeling, the design is based on a heuristic function: the automatic labeling function of the dictionary rules and the model prediction greatly saves the labeling time and improves the labeling efficiency; by analyzing the relationship between different commodity data indexes and term words, the advantages and disadvantages of the product marketing case can be analyzed more carefully. The method solves the acquisition work of the terms of the industry or the field, can actively carry out correlation analysis according to product data indexes, calculates the influence degree of the terms in the commodity description picture on the sales index data, reduces the proportion of low-quality terms, improves the quality and readability of the commodity description picture, and provides reference suggestions for the operation of the marketing terms for merchants.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

Example 2

In this embodiment, a calculation apparatus of index influence degree is further provided, which is used to implement the foregoing embodiments and preferred embodiments, and details are not repeated after the description is given. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 7 is a block diagram of a structure of an index influence degree calculation apparatus according to an embodiment of the present invention, as shown in fig. 7, the apparatus including: an acquisition module 70, an extraction module 72, a recognition module 74, a calculation module 76, wherein,

an acquiring module 70, configured to acquire a commodity description picture of a target commodity;

the extracting module 72 is configured to extract an unordered text list from the commodity description picture, and adjust the unordered text list into an ordered text list according to a text distance to obtain text semantic information of the commodity description picture;

the recognition module 74 is configured to obtain a term dictionary of the target product, and recognize a term word in the text semantic information by using the term dictionary and a term extraction model;

and the calculating module 76 is configured to obtain the sales index data of the target product, and calculate influence degrees of the plurality of term words on the sales index data respectively.

Optionally, the obtaining module includes: the acquisition unit is used for acquiring a Uniform Resource Locator (URL) address of the commodity description picture of the target commodity; and the downloading unit is used for calling a downloading tool to download the commodity description picture from the URL address.

Optionally, the extracting module includes: the extraction unit is used for extracting a plurality of text characters and corresponding text coordinates in the commodity description picture by adopting an Optical Character Recognition (OCR) tool; a first generating unit, configured to generate a unordered text list using the text characters and corresponding text coordinates, where the unordered text list includes text characters in a plurality of language orders, and each element of the unordered text list includes: text characters, horizontal coordinates corresponding to the text characters, and vertical coordinates corresponding to the text characters; and the second generating unit is used for generating text semantic information of the commodity description picture by adopting the unordered text list, wherein the text semantic information comprises a plurality of text characters which are orderly arranged according to scene semantics.

Optionally, the second generating unit includes: the extraction subunit is used for extracting a starting text in the unordered text list according to text coordinates, and determining the starting text as a target text; an iteration subunit configured to iteratively perform the following steps until the unordered text list is empty: calculating the center distance between the target text and all the remaining texts in the unordered text list; judging whether a first similar text with the center distance smaller than a preset threshold exists in the first adjacent direction; if a first similar text with the center distance smaller than a preset threshold value exists in a first adjacent direction, extracting the first similar text from the unordered text list, splicing the first similar text behind a target text, and updating the current target text into the first similar text; if the first adjacent direction does not have the first similar text with the central distance smaller than the preset threshold value, judging whether the second adjacent direction has the second similar text with the central distance smaller than the preset threshold value; if a second near text with the center distance smaller than a preset threshold value exists in a second adjacent direction, extracting the second near text from the unordered text list, splicing the second near text behind a target text, and updating the current target text into the second near text; if a second adjacent text with the center distance smaller than a preset threshold value does not exist in a second adjacent direction, extracting the initial text in the first adjacent direction, determining the initial text in the first adjacent direction as a third adjacent text of the target text, splicing the third adjacent text behind the target text, and updating the current target text into the third adjacent text; and the determining subunit is used for determining the spliced text sequence as the text semantic information of the commodity description picture.

Optionally, the identification module includes: a traversal unit, configured to traverse the text semantic information by using a keyword in the term dictionary to obtain a plurality of first target terms matching a corresponding keyword, where the term dictionary includes the keyword of the plurality of terms; the labeling unit is used for labeling the first target term as label information corresponding to a hit field in the text semantic information to obtain sample data; the training unit is used for training a sequence to a sequence model by adopting the sample data to obtain a term extraction model; the extraction unit is used for extracting a plurality of second target terms in the text semantic information by adopting the term extraction model; and the fusion unit is used for fusing the plurality of first target terms and the plurality of second target terms to obtain the term words in the text semantic information.

Optionally, the calculation module includes: a first acquisition unit configured to acquire first sales data of a first target commodity and second sales data of a second target commodity, wherein the sales index data includes the sales data, and the target commodity includes the first target commodity and the second target commodity; a first determining unit, configured to determine a first term set of the first target product, and determine a second term set of the second target product; a first assigning unit configured to assign each first term in the first term set using the first sales data, and assign each second term in the second term set based on using the second sales data; a first judging unit, configured to judge whether a same third term exists in the first term set and the second term set; a second assigning unit, configured to assign a third term to the first term set and the second term set by using a mean value of the first sales data and the second sales data if the first term set and the second term set have the same third term; a first ordering unit, configured to order the first term, the second term, and the third term based on the sales data to obtain a first sequence of influence degrees.

Optionally, the target product includes a third target product and a fourth target product, and the calculation module includes: a second acquisition unit configured to acquire third sales data of a third target commodity and fourth sales data of a fourth target commodity, wherein the sales index data includes the sales data; a second determining unit, configured to determine a third term set of the third target product and a fourth term set of the fourth target product; a third assigning unit configured to assign each term in the third set of terms using the third sales data based on assigning each term in the fourth set of terms using the fourth sales data; a filtering unit, configured to filter terms that are the same as the third term set and the fourth term set, so as to obtain a fifth term set; and the second ordering unit is used for ordering each term in the fifth term set based on the sales data to obtain a second influence degree sequence.

It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.

Example 3

Embodiments of the present invention also provide a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.

Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the steps of:

s1, acquiring a commodity description picture of a target commodity;

s2, extracting a disordered text list from the commodity description picture, and adjusting the disordered text list into an ordered text list according to a text distance to obtain text semantic information of the commodity description picture;

s3, acquiring a term dictionary of the target commodity, and identifying term words in the text semantic information by adopting the term dictionary and a term extraction model;

and S4, obtaining sales index data of the target commodity, and respectively calculating influence degrees of the plurality of term words on the sales index data.

Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.

Optionally, the electronic device may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:

s1, acquiring a commodity description picture of a target commodity;

and S4, obtaining the sales index data of the target commodity, and respectively calculating the influence degree of the plurality of term words on the sales index data.

Optionally, for a specific example in this embodiment, reference may be made to the examples described in the above embodiment and optional implementation, and this embodiment is not described herein again.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present application, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technical content can be implemented in other manners. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed coupling or direct coupling or communication connection between each other may be an indirect coupling or communication connection through some interfaces, units or modules, and may be electrical or in other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims

1. A method of calculating an index influence degree, comprising:

acquiring a commodity description picture of a target commodity;

extracting an unordered text list from the commodity description picture, and adjusting the unordered text list into an ordered text list according to a text distance to obtain text semantic information of the commodity description picture;

acquiring a term dictionary of the target commodity, and identifying term words in the text semantic information by adopting the term dictionary and a term extraction model;

and acquiring the sales index data of the target commodity, and respectively calculating the influence degree of the plurality of term words on the sales index data.

2. The method of claim 1, wherein obtaining a product description picture of a target product comprises:

acquiring a Uniform Resource Locator (URL) address of a commodity description picture of the target commodity;

and calling a downloading tool to download the commodity description picture from the URL address.

3. The method of claim 1, wherein extracting an unordered text list from the commodity description picture, and adjusting the unordered text list into an ordered text list according to a text distance, and obtaining text semantic information of the commodity description picture comprises:

extracting a plurality of text characters and corresponding text coordinates in the commodity description picture by adopting an Optical Character Recognition (OCR) tool;

generating a unordered text list using the plurality of text characters and corresponding text coordinates, wherein the unordered text list includes a plurality of text characters arranged randomly in a language order, and each element of the unordered text list includes: text characters, horizontal coordinates corresponding to the text characters, and vertical coordinates corresponding to the text characters;

and generating text semantic information of the commodity description picture by adopting the unordered text list, wherein the text semantic information comprises a plurality of text characters which are orderly arranged according to scene semantics.

4. The method of claim 3, wherein generating the text semantic information of the commodity description picture using the unordered text list comprises:

extracting an initial text in the unordered text list according to text coordinates, and determining the initial text as a target text;

iteratively performing the following steps until the unordered text list is empty: calculating the center distance between the target text and all the remaining texts in the unordered text list; judging whether a first similar text with the center distance smaller than a preset threshold exists in the first adjacent direction; if a first similar text with the center distance smaller than a preset threshold value exists in a first adjacent direction, extracting the first similar text from the unordered text list, splicing the first similar text behind a target text, and updating the current target text into the first similar text; if the first adjacent direction does not have the first similar text with the central distance smaller than the preset threshold value, judging whether the second adjacent direction has the second similar text with the central distance smaller than the preset threshold value; if a second near text with the center distance smaller than a preset threshold value exists in a second adjacent direction, extracting the second near text from the unordered text list, splicing the second near text behind a target text, and updating the current target text into the second near text; if a second adjacent text with the center distance smaller than a preset threshold value does not exist in a second adjacent direction, extracting the initial text in the first adjacent direction, determining the initial text in the first adjacent direction as a third adjacent text of the target text, splicing the third adjacent text behind the target text, and updating the current target text into the third adjacent text;

and determining the spliced text sequence as the text semantic information of the commodity description picture.

5. The method of claim 1, wherein identifying term words in the text semantic information using the term dictionary and term extraction model comprises:

traversing the text semantic information by using keywords in the term dictionary to obtain a plurality of first target terms matched with corresponding keywords, wherein the term dictionary comprises the keywords of a plurality of terms;

labeling the first target term as label information corresponding to a hit field in the text semantic information to obtain sample data;

training a sequence to a sequence model by adopting the sample data to obtain a term extraction model;

extracting a plurality of second target terms in the text semantic information by adopting the term extraction model;

and fusing the plurality of first target terms and the plurality of second target terms to obtain the term words in the text semantic information.

6. The method of claim 1, wherein obtaining sales index data of the target product and calculating influence degrees of a plurality of term words on the sales index data respectively comprises:

acquiring first sales data of a first target commodity and second sales data of a second target commodity, wherein the sales index data comprises the sales data, and the target commodity comprises the first target commodity and the second target commodity;

determining a first term set of the first target product and determining a second term set of the second target product;

assigning each first term in the first set of terms using the first sales data, assigning each second term in the second set of terms based on using the second sales data;

determining whether a same third term exists in the first term set and the second term set;

assigning a third term to the first term set and the second term set using an average of the first sales data and the second sales data if the third term is the same as the first term set and the second term set;

the first term, the second term, and the third term are ranked based on sales data to obtain a first sequence of influence.

7. The method of claim 1, wherein the target commodities comprise a third target commodity and a fourth target commodity, obtaining sales index data of the target commodities, and calculating influence degrees of a plurality of term words on the sales index data respectively comprises:

acquiring third sales data of a third target commodity and fourth sales data of a fourth target commodity, wherein the sales index data comprises the sales data;

determining a third term set of the third target good and determining a fourth term set of the fourth target good;

assigning each term in the third set of terms with the third sales data based on assigning each term in the fourth set of terms with the fourth sales data;

filtering terms which are the same as the third term set and the fourth term set to obtain a fifth term set;

and sequencing each term in the fifth term set based on the sales data to obtain a second influence degree sequence.

8. An index influence degree calculation device, comprising:

the acquisition module is used for acquiring a commodity description picture of a target commodity;

the extraction module is used for extracting the unordered text list from the commodity description picture, and adjusting the unordered text list into an ordered text list according to the text distance to obtain text semantic information of the commodity description picture;

the recognition module is used for acquiring a term dictionary of the target commodity and recognizing the term words in the text semantic information by adopting the term dictionary and a term extraction model;

and the calculation module is used for acquiring the sales index data of the target commodity and calculating the influence degree of the plurality of term words on the sales index data respectively.

9. A storage medium, comprising a stored computer program, wherein the computer program is operative to perform the steps of the method of any of the preceding claims 1 to 7.

10. An electronic device comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus; wherein:

a memory for storing a computer program;

a processor for executing the steps of the method of any one of claims 1 to 7 by running a program stored on a memory.