CN114898349A

CN114898349A - Target commodity identification method and device, equipment, medium and product thereof

Info

Publication number: CN114898349A
Application number: CN202210580467.2A
Authority: CN
Inventors: 李保俊
Original assignee: Guangzhou Huanju Shidai Information Technology Co Ltd
Current assignee: Guangzhou Huanju Shidai Information Technology Co Ltd
Priority date: 2022-05-25
Filing date: 2022-05-25
Publication date: 2022-08-12

Abstract

The application discloses a target commodity identification method and a device, equipment, a medium and a product thereof, wherein the method comprises the following steps: acquiring a commodity title and a commodity picture in commodity information of a target commodity; extracting deep semantic information of the commodity picture and the commodity title; fusing the deep semantic information of the commodity title to the deep semantic information of the commodity picture so as to highlight the image characteristics of the target commodity in the deep semantic information of the commodity picture according to the deep semantic information of the commodity title and obtain image-text fusion characteristic information; and inputting the image-text fusion characteristic information into a target detection model trained to be convergent in advance, and identifying the target commodity. The method and the device can accurately identify the target commodity in the commodity picture.

Description

Target commodity identification method and device, equipment, medium and product thereof

Technical Field

The present application relates to the field of e-commerce information technology, and in particular, to a target product identification method, and a corresponding apparatus, computer device, computer-readable storage medium, and computer program product.

Background

In order to set off the effect of the goods on the shelf, the seller users in the e-commerce platform usually match the goods on the shelf with other matching goods, and the goods pictures of the goods on the shelf not only contain the goods on the shelf but also contain other matching goods. For example, the shelving merchandise is clothes, and other matching articles can be trousers, hats, shoes and the like; the commodity on the shelf is a commodity shelf, and other matched articles can be household appliances, books, ornaments and the like. Therefore, the corresponding goods on the shelf cannot be determined by the pictures of the goods.

In the e-commerce platform, an image corresponding to a target commodity is often recognized from a commodity picture containing the target commodity, so as to implement other downstream tasks, such as implementing similar matching of commodities, displaying commodity images, and the like. If the image of the corresponding target commodity cannot be quickly acquired from the commodity picture, the efficiency of realizing related downstream tasks is affected, and related functions may not be realized, so that the user experience is reduced.

In a traditional solution, a multi-target identification method is usually adopted to identify each article from an article picture, wherein the article comprises the goods on shelf and the matched articles thereof, and then classification prediction is performed according to each article to determine an image belonging to the article phase object. The method needs to be processed in two stages, the two stages need to be implemented by different models respectively, the different models need to perform operations such as image preprocessing on images, the process is complicated, the efficiency is relatively low, more troublesome is that the relevant models in the two stages generally need to be trained by corresponding data sets respectively, and the training cost is high.

In view of the above, the applicant has attempted to explore other ways in which a target item can be quickly identified from a picture of the item that includes the target item.

Disclosure of Invention

A primary object of the present application is to solve at least one of the above problems and provide a target product identification method, and a corresponding apparatus, computer device, computer readable storage medium, and computer program product.

In order to meet various purposes of the application, the following technical scheme is adopted in the application:

the target commodity identification method adaptive to one of the purposes of the application comprises the following steps:

acquiring a commodity title and a commodity picture in commodity information of a target commodity;

extracting deep semantic information of the commodity picture and the commodity title;

fusing the deep semantic information of the commodity title to the deep semantic information of the commodity picture so as to highlight the image characteristics of the target commodity in the deep semantic information of the commodity picture according to the deep semantic information of the commodity title and obtain image-text fusion characteristic information;

and inputting the image-text fusion characteristic information into a target detection model trained to be convergent in advance, and identifying the target commodity.

In a further embodiment, the step of extracting deep semantic information of the commodity picture and the commodity title includes the following steps:

preprocessing the commodity picture, inputting the preprocessed commodity picture into an image feature extraction model which is trained to be convergent in advance, and obtaining corresponding deep semantic information for representing the image feature of the commodity picture;

and preprocessing the commodity title, inputting the preprocessed commodity title into a text feature extraction model which is trained to be convergent in advance, and obtaining corresponding deep semantic information for representing the text feature of the commodity title.

In a further embodiment, the step of preprocessing the title of the product includes the following steps:

filtering invalid characters in the commodity title;

and segmenting the filtered commodity title to obtain key words in the filtered commodity title, wherein the key words comprise product words and/or brand words of the target commodity, and preprocessing the commodity title is completed.

In a further embodiment, the step of fusing the deep semantic information of the commodity title to the deep semantic information of the commodity picture to highlight the image feature of the target commodity in the deep semantic information of the commodity picture according to the deep semantic information of the commodity title, and obtaining the image-text fused feature information includes the following steps:

fusing deep semantic information of the commodity title and deep semantic information of the commodity picture by adopting a multi-modal feature interactive fusion module to obtain preliminary fusion feature information, wherein the preliminary fusion feature information obviously represents the features of the image of the target commodity;

combining the preliminary fusion feature information with deep semantic information of the commodity picture to obtain image-text fusion feature information;

in a preferred embodiment, the step of fusing the deep semantic information of the commodity title and the deep semantic information of the commodity picture by using the multi-modal feature interactive fusion module to obtain the preliminary fusion feature information comprises the following steps:

constructing a query vector by using the deep semantic information of the commodity picture, constructing a key vector and a value vector by using the deep semantic information of the commodity title, and inputting the key vector and the value vector into an attention layer;

interacting and normalizing the query vector and the key vector by the attention layer to obtain a weight matrix;

matching, by the attention layer, the value vector to the weight matrix to obtain preliminary fused feature information.

In a further embodiment, the step of inputting the image-text fusion characteristic information into a target detection model trained to converge in advance and identifying the target commodity includes the following steps:

detecting the target commodity in the commodity picture according to the image-text fusion characteristic information by adopting a target detection model which is trained to be convergent in advance to obtain a corresponding detection area;

and obtaining a rectangular frame with the minimum area surrounding the detection area, and selecting a target commodity as a recognition result by using the rectangular frame.

In an extended embodiment, after the step of inputting the image-text fusion characteristic information into a target detection model trained to be convergent in advance and identifying the target commodity, the method further comprises the following steps:

intercepting an image of the target commodity from the commodity picture according to the rectangular frame of the target commodity selected by the frame, and storing the unique identification code of the target commodity associated with the image in a commodity database;

responding to the commodity recommendation request, searching a commodity database according to the unique identification code of the target commodity to obtain an image of the target commodity, and matching with a recommended commodity similar to the image;

and responding the commodity recommendation request and pushing the recommended commodity.

A target product identification device adapted to one of the objects of the present application includes: the system comprises an image-text acquisition module, a semantic extraction module, a feature fusion module and a target identification module, wherein the image-text acquisition module is used for acquiring a commodity title and a commodity picture in commodity information of a target commodity; the semantic extraction module is used for extracting deep semantic information of the commodity picture and the commodity title; the characteristic fusion module is used for fusing the deep semantic information of the commodity title to the deep semantic information of the commodity picture so as to highlight the image characteristics of the target commodity in the deep semantic information of the commodity picture according to the deep semantic information of the commodity title and obtain image-text fusion characteristic information; and the target identification module is used for inputting the image-text fusion characteristic information into a target detection model which is trained to be convergent in advance, and identifying the target commodity.

In a further embodiment, the semantic extraction module includes: the image feature extraction submodule is used for preprocessing the commodity picture, inputting the preprocessed commodity picture into an image feature extraction model which is trained to be convergent in advance, obtaining corresponding deep semantic information and representing the image features of the commodity picture; and the text feature extraction submodule is used for preprocessing the commodity title, inputting the preprocessed commodity title into a text feature extraction model which is trained to be convergent in advance, and obtaining corresponding deep semantic information used for representing the text feature of the commodity title.

In a further embodiment, the image feature extraction sub-module includes: the character filtering unit is used for filtering invalid characters in the commodity title; and the text word segmentation unit is used for segmenting the filtered commodity title to obtain key words in the filtered commodity title, wherein the key words comprise product words and/or brand words of the target commodity, and the preprocessing of the commodity title is completed.

In a further embodiment, the feature fusion module includes: the semantic fusion sub-module is used for fusing deep semantic information of the commodity title and deep semantic information of the commodity picture by adopting a multi-mode feature interaction fusion module to obtain preliminary fusion feature information, wherein the preliminary fusion feature information obviously represents the features of the image of the target commodity; the information combining sub-module is used for combining the preliminary fusion characteristic information with the deep semantic information of the commodity picture to obtain image-text fusion characteristic information;

in a preferred embodiment, the semantic fusion sub-module includes: a vector input unit for constructing a query vector with the deep semantic information of the commodity picture, constructing a key vector and a value vector with the deep semantic information of the commodity title, and inputting an attention layer; the weight extraction unit is used for interacting and normalizing the query vector and the key vector by the attention layer to obtain a weight matrix; and the feature generation unit is used for matching the value vector with the weight matrix by the attention layer to obtain preliminary fusion feature information.

In a further embodiment, the object recognition module includes: the target detection unit is used for detecting the target commodity in the commodity picture according to the image-text fusion characteristic information by adopting a target detection model which is trained to be convergent in advance to obtain a corresponding detection area; and the frame selection identification unit is used for solving a rectangular frame with the minimum area surrounding the detection area, and selecting the target commodity as an identification result by using the rectangular frame.

In an extended embodiment, after the target identifying module, the method further includes: the intercepting storage module is used for intercepting an image of the target commodity from the commodity picture according to the rectangular frame of the target commodity selected by the frame and storing the unique identification code of the target commodity associated with the image in a commodity database; the response request module is used for responding to the commodity recommendation request, retrieving the commodity database according to the unique identification code of the target commodity to acquire the image of the target commodity, and matching the recommended commodity similar to the image; and the response request module is used for responding the commodity recommendation request and pushing the recommended commodity.

The technical solution of the present application has various advantages, including but not limited to the following aspects:

firstly, the method highlights the image characteristics of the target commodity in the deep semantic information of the commodity picture by utilizing the deep semantic information of the commodity title, and provides key information for indicating the target commodity in the commodity picture, so that the single object can be identified according to the key information, and the target commodity can be quickly and accurately identified from the commodity picture;

secondly, multi-mode feature fusion is adopted, deep semantic information of a commodity title and deep semantic information of a commodity picture are fused, and then a target commodity can be recognized according to the fusion features, so that a model architecture for realizing recognition and training steps of the model architecture can be simplified.

In addition, the target commodity identification function realized by the method can be applied to various related downstream tasks in the e-commerce platform, such as commodity similarity matching, commodity classification, commodity labels and the like, and moreover, the method can be understood as being capable of extracting the image of the target commodity in a targeted manner, and is helpful for accurately providing the image required by the downstream task.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a schematic flow chart diagram of an exemplary embodiment of a target commodity identification method of the present application;

FIG. 2 is a schematic diagram of a multi-modal feature interaction module in an embodiment of the present application;

FIG. 3 is a schematic diagram of an implementation architecture of a target product identification model used in an embodiment of the present application;

FIG. 4 is a schematic diagram illustrating a process of extracting deep semantic information according to an embodiment of the present application;

FIG. 5 is a flow chart illustrating a process of pre-processing a title of an article according to an embodiment of the present application;

fig. 6 is a schematic flow chart of obtaining image-text fusion characteristic information in the embodiment of the present application;

fig. 7 is a schematic flow chart of obtaining preliminary image-text fusion feature information in the embodiment of the present application;

fig. 8 is a schematic flowchart illustrating a process of identifying a target product from a product picture according to an embodiment of the present application;

FIG. 9 is a schematic flow chart of commodity recommendation in an embodiment of the present application;

FIG. 10 is a functional block diagram of a target item identification device of the present application;

fig. 11 is a schematic structural diagram of a computer device used in the present application.

Detailed description of the invention

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

As will be appreciated by those skilled in the art, "client," "terminal," and "terminal device" as used herein include both devices that are wireless signal receivers, which are devices having only wireless signal receivers without transmit capability, and devices that are receive and transmit hardware, which have receive and transmit hardware capable of two-way communication over a two-way communication link. Such a device may include: cellular or other communication devices such as personal computers, tablets, etc. having single or multi-line displays or cellular or other communication devices without multi-line displays; PCS (Personal Communications Service), which may combine voice, data processing, facsimile and/or data communication capabilities; a PDA (Personal Digital Assistant), which may include a radio frequency receiver, a pager, internet/intranet access, a web browser, a notepad, a calendar and/or a GPS (Global Positioning System) receiver; a conventional laptop and/or palmtop computer or other device having and/or including a radio frequency receiver. As used herein, a "client," "terminal device" can be portable, transportable, installed in a vehicle (aeronautical, maritime, and/or land-based), or situated and/or configured to operate locally and/or in a distributed fashion at any other location(s) on earth and/or in space. The "client", "terminal Device" used herein may also be a communication terminal, a web terminal, a music/video playing terminal, such as a PDA, an MID (Mobile Internet Device) and/or a Mobile phone with music/video playing function, and may also be a smart tv, a set-top box, and the like.

The hardware referred to by the names "server", "client", "service node", etc. is essentially an electronic device with the performance of a personal computer, and is a hardware device having necessary components disclosed by the von neumann principle such as a central processing unit (including an arithmetic unit and a controller), a memory, an input device, an output device, etc., a computer program is stored in the memory, and the central processing unit calls a program stored in an external memory into the internal memory to run, executes instructions in the program, and interacts with the input and output devices, thereby completing a specific function.

It should be noted that the concept of "server" in the present application can be extended to the case of server cluster. According to the network deployment principle understood by those skilled in the art, the servers should be logically divided, and in physical space, the servers may be independent from each other but can be called through an interface, or may be integrated into one physical computer or a set of computer clusters. Those skilled in the art will appreciate this variation and should not be so limited as to restrict the implementation of the network deployment of the present application.

One or more technical features of the present application, unless expressly specified otherwise, may be deployed to a server for implementation by a client remotely invoking an online service interface provided by a capture server for access, or may be deployed directly and run on the client for access.

Unless specified in clear text, the neural network model referred to or possibly referred to in the application can be deployed in a remote server and used for remote call at a client, and can also be deployed in a client with qualified equipment capability for direct call.

Various data referred to in the present application may be stored in a server remotely or in a local terminal device unless specified in the clear text, as long as the data is suitable for being called by the technical solution of the present application.

The person skilled in the art will know this: although the various methods of the present application are described based on the same concept so as to be common to each other, they may be independently performed unless otherwise specified. In the same way, for each embodiment disclosed in the present application, it is proposed based on the same inventive concept, and therefore, concepts of the same expression and concepts of which expressions are different but are appropriately changed only for convenience should be equally understood.

The embodiments to be disclosed herein can be flexibly constructed by cross-linking related technical features of the embodiments unless the mutual exclusion relationship between the related technical features is stated in the clear text, as long as the combination does not depart from the inventive spirit of the present application and can meet the needs of the prior art or solve the deficiencies of the prior art. Those skilled in the art will appreciate variations therefrom.

The target commodity identification method can be programmed into a computer program product, is deployed in a server to run and is implemented, for example, in an e-commerce platform application scenario of the application, the target commodity identification method is generally deployed in the server to be implemented, so that the method can be executed by accessing an open interface after the computer program product runs and performing human-computer interaction with a process of the computer program product through a graphical user interface.

The target recognition model is an integrated model and comprises a neural network model for extracting deep semantic information corresponding to a commodity picture, a neural network model for extracting deep semantic information corresponding to a commodity title and a neural network model for recognizing a target commodity in a commodity image.

Referring to fig. 1, in an exemplary embodiment of a target product identification method of the present application, the method includes the following steps:

step S1100, acquiring a commodity title and a commodity picture in the commodity information of the target commodity;

in an application scene of the e-commerce platform, each commodity can be treated as a relatively independent single information unit, and a merchant user of an online shop of the e-commerce platform is responsible for publishing, maintaining and updating, and can provide browsing, ordering and the like for a consumer user. The online shop can be an independent site, the independent site independently maintains a commodity database of commodities of the online shop, and target commodities in the commodity picture can be identified by installing a computer program product realized by the application. Each commodity has corresponding commodity information for describing the commodity, and the commodity information usually includes a commodity title and a commodity picture.

The target commodity is an on-shelf commodity, and a merchant user of an online shop of the e-commerce platform is responsible for on-shelf publishing and selling of the target commodity.

The commodity pictures are usually used for displaying a target commodity, and the pictures comprise the target commodity or other matched articles matched with the target commodity for setting off the effect of the target commodity, for example, when the target commodity is a skirt, one of the commodity pictures can be a shoe, a garment, a jewelry and a coat matched with the skirt and worn on the skirt by a model so as to display the effect of the skirt; when the target commodity is the commodity shelf, one of the commodity pictures can be the commodity shelf and the books, household appliances and ornaments matched with the commodity shelf, so as to show the effect of the commodity shelf. That is, the product picture may have other contents besides the target product, and in these contents, other products besides the current target product may appear.

The commodity title is stored in association with the target commodity and starts from commodity description information provided in a text form. In terms of application, the commodity title accurately describes any specific information of the name, brand, material, function, application, selling point and the like of the target commodity in a concise language expression;

in one embodiment, the corresponding commodity information can be acquired from a commodity database of an online store according to a unique identification code of a target commodity, wherein the unique identification code is a unique identification set by software engineering personnel for distinguishing each target commodity of an e-commerce platform, so that the commodity information of the target commodity can be stored and called conveniently.

When an on-shelf user of an on-line shop needs to publish a certain target commodity on shelf, inputting commodity information corresponding to the target commodity into a commodity publishing page corresponding to an e-commerce platform, and then submitting the commodity information to a background server so as to store the unique identification code corresponding to the target commodity associated with the corresponding commodity information in a commodity database.

S1200, extracting deep semantic information of the commodity picture and the commodity title;

and performing image feature extraction on the commodity picture by adopting a plurality of image feature extraction models trained to be convergent in advance, and extracting deep semantic information corresponding to the visual features of the target commodity and other corresponding matched articles in the commodity picture. The image feature extraction model generally includes a neural network model, such as Resnet, EfficientNet, and the like, which is implemented based on CNN and is suitable for performing deep semantic feature extraction on a picture, and can be flexibly selected by a person skilled in the art.

The method can adopt a plurality of text feature extraction models which are trained to be convergent in advance to extract the text features of the commodity title, and deep semantic information corresponding to the text features which characterize the target commodity in the commodity title is extracted. The text feature extraction model generally includes a neural network model, such as Bert, LSTM, electrora, and the like, which is implemented based on RNN and is suitable for deep semantic feature extraction of text, and can be flexibly selected by those skilled in the art.

Step S1300, fusing the deep semantic information of the commodity title to the deep semantic information of the commodity picture so as to highlight the image characteristics of the target commodity in the deep semantic information of the commodity picture according to the deep semantic information of the commodity title and obtain image-text fusion characteristic information;

in one embodiment, feature interaction corresponding to a self-attention mechanism is performed on the deep semantic information of the commodity title and the deep semantic information of the commodity picture in an attention layer 200 as shown in fig. 2, so that the deep semantic information of the commodity title and the deep semantic information of the commodity picture realize depth interaction at a feature level, thereby realizing depth fusion of the commodity picture and the commodity title at the deep semantic level, and obtaining preliminary image-text fusion feature information output by the attention layer, it can be understood that, due to the implementation of the depth fusion, the deep semantic information of the commodity title is fused into the deep semantic information of the commodity picture, and the commodity title has an indication effect on a target commodity in text semantics, so that the features of an image of the target commodity are obviously represented by referring to the text semantics of the commodity title in the preliminary image-text fusion feature information, the following embodiments will further disclose the feature interaction corresponding to this self-attention mechanism, which is not shown here. Further, referring to fig. 3, the preliminary fusion feature information and the deep semantic information 300 of the commodity picture are combined to obtain the image-text fusion feature information, and the combination may adopt a matrix addition manner, so that it is easy to understand, and the image-text fusion feature can be used to identify the target commodity due to the feature that significantly represents the image of the target commodity in the preliminary fusion feature information.

And S1400, inputting the image-text fusion characteristic information into a target detection model trained to be converged in advance, and identifying the target commodity.

And detecting the target commodity in the commodity picture to identify the target commodity, wherein the target commodity can be implemented by adopting a target detection model which is trained to be in a convergence state in advance. The target detection model is generally implemented by using a deep learning-based model, such as an RCNN series, a Yolo series, and an SSD (Single Shot multi box Detector) series. The RCNN series is a representative algorithm based on region detection, YOLO is a representative algorithm based on region extraction, and SSD is an algorithm obtained by improving on the basis of the first two series.

The RCNN series generally includes different specific models such as R-CNN, SPPNet, FastR-CNN, FasterR-CNN, etc., and multiple versions of the Yolo series can be used. Such object detection models are all suitable for identifying object image areas from a given picture, so that corresponding object images can be obtained according to the object image areas.

In one embodiment, Yolo-v5 may be used as a target detection model, and a classifier is accessed to perform fine tuning training on the target detection model by using a sufficient amount of training samples, where the training samples are commodity pictures, and include target commodities and other matching articles matched with the target commodities, and each training sample provides a corresponding training label for each target commodity, so as to supervise model training, and enable the model training to learn the capability of accurately identifying the region corresponding to the target commodity from a given commodity picture.

Therefore, the image-text fusion characteristic information is input into a target detection model which is trained to be convergent in advance, and the image-text fusion characteristic information is fused with the semantic information corresponding to the commodity title according to the image-text fusion characteristic information, so that the image area corresponding to the target commodity can be detected from the commodity picture under the action of the semantic information, and the coordinate information of the target commodity in the commodity picture is output. Further, according to the coordinate information output by the target detection model, the image of the target commodity corresponding to the coordinate information of the target commodity is cut out from the commodity picture correspondingly, and therefore the target commodity is identified from the commodity picture.

In light of the exemplary embodiments disclosed herein, it can be seen that the present application has various advantages, including at least:

Referring to fig. 4, in a further embodiment, in the step S1200, the step of extracting deep semantic information of the product picture and the product title includes the following steps:

step S1210, preprocessing the commodity picture, inputting the preprocessed commodity picture into an image feature extraction model which is trained to be convergent in advance, and obtaining corresponding deep semantic information used for representing the image feature of the commodity picture;

in order to facilitate the subsequent model to extract the image characteristics corresponding to the commodity picture, the commodity picture is preprocessed, the length and the width of the commodity picture are amplified in the same proportion, and the proportion can be flexibly set by a person skilled in the art according to prior knowledge or experimental data.

In one embodiment, the image feature extraction model is Resnet50, the preprocessed commodity picture is input into Resnet50 trained in advance to converge, the main block (stem lock) and 4 residual blocks (bottomblocks) of Resnet50 gradually extract the image features corresponding to the commodity picture, wherein a shallow stage (stage) extracts the basic features corresponding to the target commodity in the commodity picture and other matching items matched with the target commodity, such as details and edges, and further extracts deep semantic features and high-level logic features at a deep stage, and finally, deep semantic information output by Res5 stage is obtained at the final stage, as shown in 300 in fig. 3.

Step S1220, preprocessing the product title, inputting the preprocessed product title to a text feature extraction model trained to converge in advance, and obtaining corresponding deep semantic information for representing the text feature of the product title.

Generally speaking, the text format in the product title is relatively complicated, and may include line feed characters, redundant punctuations, redundant blank characters, and the like, and these characters have no great influence on the semantics of the product title itself, but may interfere with the accuracy of subsequent semantic extraction, so as to improve the accuracy of extracting deep semantic information from the subsequent model, format preprocessing may be performed on the product title, and the preprocessing may include, for example: replacing the line feed character with a space character; replacing more than 2 blank character strings with only one blank character; more than 2 punctuation marks are replaced by only one reserved, and so on. The format preprocessing mode is adopted as required, and the technical field can flexibly implement the format preprocessing mode according to the actual service condition.

In one embodiment, the text feature extraction model is Bert, the preprocessed commodity title is input to the Bert trained in advance to be convergent, and text features corresponding to texts for representing commodity types of the target commodities in the commodity title are extracted, for example, the commodity title is a handmade spiked-on-breast-decorated elegant lady skirt, and the texts for representing the commodity types of the target commodities are women skirts, so that corresponding deep semantic information is obtained.

In the embodiment, the deep semantic information corresponding to the commodity picture and the commodity title is intelligently, quickly and accurately extracted correspondingly through the image feature extraction model and the text feature extraction model which are pre-trained to be convergent.

Referring to fig. 5, in the embodiment, the step S1220 of preprocessing the product title includes the following steps:

step S1221, filtering invalid characters in the commodity title;

generally, the title of the target product usually includes a text of a product category of the target product, and a modified text of an effect, a function, a texture, a material and the like of the target product, and it can be understood that for implementing the target product identification of the present application, the modified text is an invalid character, and therefore, the invalid character in the title of the product can be filtered, so as to facilitate a subsequent model to extract a corresponding text feature. In one embodiment, the texts corresponding to the modified texts can be collected in advance in a manual or artificial intelligence mode and integrated into an invalid character dictionary, so that the texts corresponding to the commodity titles and the texts corresponding to the modified texts in the invalid character dictionary are subjected to accurate matching and/or fuzzy matching, then the invalid characters in the commodity titles are determined according to matching results, and the invalid characters are deleted to filter the commodity titles.

Step S1222, performing word segmentation on the filtered product titles to obtain keywords therein, where the keywords include product words and/or brand words of the target product, and completing the preprocessing of the product titles.

In one embodiment, the filtered commodity title is participled by using a word segmenter, namely a basic tokenizer and a Wordpiecetokenizer, a token list with a relatively coarse score is obtained through the basic tokenizer, and then the Wordpiecetokenizer is performed on each token once, so that keywords in the token list are obtained, and the commodity title is preprocessed.

In the embodiment, by filtering the invalid characters in the commodity title, when the subsequent model extracts corresponding text features, the texts needing to be processed are fewer, the interference is also smaller, and the execution efficiency and the accuracy of the model can be improved.

Referring to fig. 6, in a further embodiment, in step S1300, the step of fusing the deep semantic information of the product title to the deep semantic information of the product picture to highlight the image feature of the target product in the deep semantic information of the product picture according to the deep semantic information of the product title, and obtaining the image-text fusion feature information includes the following steps:

step S1310, fusing the deep semantic information of the commodity title and the deep semantic information of the commodity picture by adopting a multi-mode feature interactive fusion module to obtain primary fusion feature information, wherein the primary fusion feature information obviously represents the features of the image of the target commodity;

as shown in fig. 2, specifically, in the attention layer 200, two same convolution layers are used for extracting two corresponding pieces of feature information from the deep semantic information of the product title, and two same convolution layers are used for extracting two corresponding pieces of feature information from the deep semantic information of the product picture. Further, one of the feature information corresponding to the extracted deep semantic information of the commodity picture and two same feature information corresponding to the extracted deep semantic information of the commodity title are subjected to feature interaction corresponding to a self-attention mechanism, so that the deep semantic information of the commodity title and the deep semantic information of the commodity picture realize depth interaction at a feature level, thereby realizing depth fusion of the commodity picture and the commodity title at the deep semantic level, and obtaining primary fusion image-text fusion feature information output by an attention layer, so that it can be understood that the deep semantic information of the commodity title is fused into the deep semantic information of the commodity picture due to the implementation of the depth fusion, and the commodity title has an indication effect on a target commodity in text semantics, and therefore, the image-text primary fusion feature information can obviously represent the feature of the image of the target commodity by referring to the text semantics of the commodity title, the following embodiments will further disclose the feature interaction corresponding to this self-attention mechanism, which is not shown here.

In an optional embodiment, please refer to fig. 2, further, a convolution layer may be adopted to extract feature information 201 corresponding to the preliminary image-text fusion feature information, and matrix multiplication may be performed on the feature information 201 and another feature information 202 corresponding to the extracted deep semantic information of the commodity picture, so that it is easy to understand that, after the matrix multiplication, features of an image representing a target commodity in the commodity picture in the preliminary image-text fusion feature information are further featured and displayed, and feature-displayed preliminary image-text fusion feature information is obtained. The matrix dot multiplication is performed by using two matrixes with the same dimension to perform bit-wise corresponding multiplication on feature data in the two matrixes, namely, feature data in a first row and a first column in a matrix corresponding to the feature information 201 is multiplied by feature data in a first row and a first column in a matrix corresponding to the feature information 202, feature data in a first row and a second column in a matrix corresponding to the feature information 201 is multiplied by feature data in a first row and a second column in a matrix corresponding to the feature information 202, feature data in a second row and a first column in a matrix corresponding to the feature information 201 is multiplied by feature data in a second row and a first column in a matrix corresponding to the feature information 202, and so on.

Step S1320, combining the preliminary fusion characteristic information with the deep semantic information of the commodity picture to obtain image-text fusion characteristic information;

combining the preliminary fusion feature information and the deep semantic information of the commodity picture as shown in 300 in fig. 3 to obtain image-text fusion feature information, wherein the combination can adopt a matrix addition mode, so that the image-text fusion feature can be used for identifying the image of the target commodity due to the feature which obviously represents the image of the target commodity in the preliminary fusion feature information.

In this embodiment, the features of the image corresponding to the target commodity in the deep semantic information of the commodity picture are visualized through feature interaction corresponding to the self-attention mechanism, so that the features of the image representing the target commodity in the commodity picture in the image-text fusion feature information are highlighted to become the salient features, which is beneficial to improving the accuracy of the subsequent model in identifying the target commodity in the commodity picture.

Referring to fig. 7, in the preferred embodiment, in the step S1310, the step of fusing the deep semantic information of the product title and the deep semantic information of the product picture by using the multi-modal feature interactive fusion module to obtain the preliminary fusion feature information includes the following steps:

step S1311, constructing a query vector by using the deep semantic information of the commodity picture, constructing a key vector and a value vector by using the deep semantic information of the commodity title, and inputting an attention layer;

please refer toIn fig. 2, in the Attention layer (Attention)200, the deep semantic information of the product picture and the deep semantic information of the product title are used as the input of the Attention layer, and the corresponding convolution layer, i.e., the weight matrix W is used ^Q Extracting deep semantic information of commodity pictures to obtain corresponding Query vectors (Query), and using corresponding two convolution layers, namely weight matrix W ^K 、W ^V And extracting deep semantic information of the commodity title to obtain a corresponding Key vector (Key) and a Value vector (Value). The weight matrix W ^Q 、W ^K 、W ^V All can be learned weights.

Step S1312, the attention layer interacts and normalizes the query vector and the key vector to obtain a weight matrix;

continuing to refer to fig. 2, in the attention layer 200, performing matrix multiplication on the query vector and the transposed matrix corresponding to the key vector to obtain a product matrix for realizing feature interaction between deep semantic information of the commodity picture and deep semantic information of the commodity title, where the product matrix is a scale of HW × T, and after activating and outputting the product matrix by using a Softmax function, the obtained weight matrix is semantic information obtained after deep semantic information of the commodity picture and deep semantic information of the commodity title are subjected to deep interaction, and essentially, a weighting result for highlighting a significant feature in the deep semantic information of the commodity picture, that is, a feature of an image of a target commodity in the commodity picture is also realized according to the deep semantic information of the commodity title.

Step S1313, matching the value vector with the weight matrix by the attention layer to obtain an initial feature;

continuing with fig. 2, in the attention layer 200, the weighting matrix with the dimension of HW × T, which is to be output after being activated by the Softmax function, and the transpose matrix corresponding to the value vector, that is, the dimension of T × C _i Performing matrix multiplication operation on the text features to obtain the scale of HW x C _i The product matrix is preliminary fusion feature information obtained by performing feature interaction corresponding to a self-attention mechanism on the deep semantic information of the commodity picture and the deep semantic information of the commodity title.

In this embodiment, the preliminary fusion feature information is obtained by multiplying the weight matrix obtained by interacting the deep semantic information of the commodity picture with the deep semantic information of the commodity title by the deep semantic information of the commodity title and correspondingly matching the weight W ^V And the deep semantic information of the commodity title is deeply fused with the deep semantic information of the commodity picture again, so that the features of the image of the target commodity are obviously represented in the primary fusion feature information.

Referring to fig. 8, in a further embodiment, the step S1400 of inputting the image-text fusion feature information into a pre-trained to converged target detection model and identifying the target commodity includes the following steps:

step 1410, detecting a target commodity in the commodity picture according to the image-text fusion characteristic information by adopting a target detection model which is trained to be convergent in advance, and obtaining a corresponding detection area;

in one embodiment, the target detection model is MaskRCNN, the image-text fusion characteristic information input value is pre-trained into converged MaskRCNN, and a detection region corresponding to the image of the target commodity is detected from the commodity picture.

And step S1420, obtaining a rectangular frame with the minimum area surrounding the detection area, and selecting the target commodity as the identification result by the frame.

And obtaining a rectangular frame with the minimum area surrounding the detection area, so that the rectangular frame completely contains the image of the target commodity in the detection area in a rectangular shape, and the area corresponding to the area containing the image of the non-target commodity is the minimum, and obtaining the rectangular frame with the minimum area and the position information corresponding to the rectangular frame in the commodity picture, wherein the position information is the coordinates of four vertexes corresponding to the rectangular frame. Further, the image of the target commodity in the commodity picture is selected as a recognition result according to the rectangular frame.

In this embodiment, the rectangular frame with the minimum area surrounding the detection area is obtained, and the target commodity is selected as the recognition result, so that the accuracy of the recognition result is improved.

Referring to fig. 9, in the expanded embodiment, after the step of inputting the image-text fusion feature information into the pre-trained to-converged target detection model and identifying the target commodity in step S1400, the method further includes the following steps:

s1500, capturing an image of the target commodity from the commodity picture according to the rectangular frame of the target commodity selected by the frame, and storing the unique identification code of the associated target commodity in a commodity database;

and intercepting an image of the target commodity from the commodity picture according to the position information of the rectangular frame of the selected target commodity in the commodity picture, and storing the unique identification code of the image key target commodity of the target commodity in a commodity database for subsequent calling.

Step S1600, responding to the commodity recommendation request, retrieving the commodity database according to the unique identification code of the target commodity to obtain the image of the target commodity, and matching the recommended commodity similar to the image;

it can be understood that recommended commodities need to be loaded on a partial e-commerce page in the e-commerce platform, so that a commodity recommendation request is triggered to be generated and pushed to a server of the e-commerce platform, the server receives and responds to the request, retrieves a commodity database according to a unique identification code of a target commodity to obtain an image of the target commodity, performs picture similarity matching between the image of the target commodity and an image of the target commodity corresponding to a commodity of the same commodity category as the target commodity in the commodity database, and takes the commodity with the matching similarity exceeding a threshold value as a recommended commodity, wherein the threshold value can be set by a person skilled in the art according to business needs, the commodity category is generally set by the e-commerce platform, and a merchant user of an online store generally needs to select a commodity category corresponding to the issued commodity when the merchant user issues the commodity, so that the commodities in the e-commerce platform all have the corresponding commodity categories.

And S1700, responding to the commodity recommendation request and pushing the recommended commodity.

Further, responding to the commodity recommendation request, pushing the recommended commodity to a corresponding e-commerce page, and receiving the recommended commodity by the e-commerce page and loading and displaying the recommended commodity.

In this embodiment, since the images of the target commodities corresponding to the commodity images of the commodities in the e-commerce platform are captured, the commodity recommendation accuracy is guaranteed, and accurate recommendation is realized.

Referring to fig. 10, a target product identification apparatus adapted to one of the objectives of the present application is a functional implementation of the target product identification method of the present application, and the apparatus includes: the system comprises an image-text acquisition module 1100, a semantic extraction module 1200, a feature fusion module 1300 and a target identification module 1400, wherein the image-text acquisition module 1100 is used for acquiring a commodity title and a commodity picture in commodity information of a target commodity; a semantic extraction module 1200, configured to extract deep semantic information of the commodity picture and the commodity title; the feature fusion module 1300 is configured to fuse the deep semantic information of the commodity title to the deep semantic information of the commodity picture, so as to highlight the image feature of the target commodity in the deep semantic information of the commodity picture according to the deep semantic information of the commodity title, and obtain image-text fusion feature information; and the target identification module 1400 is configured to input the image-text fusion characteristic information to a target detection model trained to converge in advance, and identify the target commodity.

In a further embodiment, the semantic extraction module 1200 includes: the image feature extraction submodule is used for preprocessing the commodity picture, inputting the preprocessed commodity picture into an image feature extraction model which is trained to be convergent in advance, obtaining corresponding deep semantic information and representing the image features of the commodity picture; and the text feature extraction submodule is used for preprocessing the commodity title, inputting the preprocessed commodity title into a text feature extraction model which is trained to be convergent in advance, and obtaining corresponding deep semantic information used for representing the text feature of the commodity title.

In a further embodiment, the feature fusion module 1300 includes: the semantic fusion submodule is used for fusing deep semantic information of the commodity title and deep semantic information of the commodity picture by adopting a multi-mode feature interactive fusion module to obtain preliminary fusion feature information, wherein the preliminary fusion feature information obviously represents the features of the image of the target commodity; the information combining sub-module is used for combining the preliminary fusion characteristic information with the deep semantic information of the commodity picture to obtain image-text fusion characteristic information;

In a further embodiment, the target recognition module 1400 includes: the target detection unit is used for detecting the target commodity in the commodity picture according to the image-text fusion characteristic information by adopting a target detection model which is trained to be convergent in advance to obtain a corresponding detection area; and the frame selection identification unit is used for solving a rectangular frame with the minimum area surrounding the detection area, and selecting the target commodity as an identification result by using the rectangular frame.

In an extended embodiment, after the target identifying module 1400, the method further includes: the intercepting storage module is used for intercepting an image of the target commodity from the commodity picture according to the rectangular frame of the target commodity selected by the frame and storing the unique identification code of the target commodity associated with the image in a commodity database; the response request module is used for responding to the commodity recommendation request, retrieving the commodity database according to the unique identification code of the target commodity to acquire the image of the target commodity, and matching the recommended commodity similar to the image; and the response request module is used for responding the commodity recommendation request and pushing the recommended commodity.

In order to solve the technical problem, an embodiment of the present application further provides a computer device. As shown in fig. 11, the internal structure of the computer device is schematically illustrated. The computer device includes a processor, a computer-readable storage medium, a memory, and a network interface connected by a system bus. The computer readable storage medium of the computer device stores an operating system, a database and computer readable instructions, the database can store control information sequences, and the computer readable instructions, when executed by the processor, can cause the processor to implement a target commodity identification method. The processor of the computer device is used for providing calculation and control capability and supporting the operation of the whole computer device. The memory of the computer device may have stored therein computer readable instructions, which, when executed by the processor, may cause the processor to perform the target item identification method of the present application. The network interface of the computer device is used for connecting and communicating with the terminal. Those skilled in the art will appreciate that the architecture shown in fig. 11 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In this embodiment, the processor is configured to execute specific functions of each module and its sub-module in fig. 10, and the memory stores program codes and various data required for executing the modules or sub-modules. The network interface is used for data transmission to and from a user terminal or a server. The memory in the present embodiment stores program codes and data necessary for executing all modules and submodules in the target product identification device of the present application, and the server can call the program codes and data of the server to execute the functions of all the submodules.

The present application also provides a storage medium storing computer-readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the target item identification method of any of the embodiments of the present application.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments of the present application can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when the computer program is executed, the processes of the embodiments of the methods can be included. The storage medium may be a computer-readable storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).

Those of skill in the art will appreciate that the various operations, methods, steps in the processes, acts, or solutions discussed in this application can be interchanged, modified, combined, or eliminated. Further, other steps, measures, or schemes in various operations, methods, or flows that have been discussed in this application can be alternated, altered, rearranged, broken down, combined, or deleted. Further, steps, measures, schemes in the prior art having various operations, methods, procedures disclosed in the present application may also be alternated, modified, rearranged, decomposed, combined, or deleted.

The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims

1. A target commodity identification method is characterized by comprising the following steps:

2. The method for identifying the target commodity according to claim 1, wherein the step of extracting the deep semantic information of the commodity picture and the commodity title comprises the following steps:

3. The method for identifying a target product according to claim 2, wherein the step of preprocessing the product title comprises the steps of:

filtering invalid characters in the commodity title;

4. The method for identifying the target commodity according to claim 1, wherein the step of fusing the deep semantic information of the commodity title to the deep semantic information of the commodity picture to highlight the image feature of the target commodity in the deep semantic information of the commodity picture according to the deep semantic information of the commodity title, and obtaining the image-text fusion feature information comprises the following steps:

5. the target commodity identification method according to claim 3, wherein the step of fusing the deep semantic information of the commodity title and the deep semantic information of the commodity picture by using a multi-modal feature interactive fusion module to obtain preliminary fusion feature information comprises the steps of:

6. The method for identifying the target commodity according to claim 1, wherein the step of inputting the image-text fusion feature information into a target detection model trained to converge in advance to identify the target commodity comprises the following steps:

7. The method for identifying a target commodity according to claim 1, wherein the step of inputting the image-text fusion characteristic information into a target detection model trained to converge in advance and identifying the target commodity further comprises the following steps:

8. A target article identification device, comprising:

the image-text acquisition module is used for acquiring a commodity title and a commodity picture in the commodity information of the target commodity;

the semantic extraction module is used for extracting deep semantic information of the commodity picture and the commodity title;

the characteristic fusion module is used for fusing the deep semantic information of the commodity title to the deep semantic information of the commodity picture so as to highlight the image characteristics of the target commodity in the deep semantic information of the commodity picture according to the deep semantic information of the commodity title and obtain image-text fusion characteristic information;

and the target identification module is used for inputting the image-text fusion characteristic information into a target detection model which is trained to be convergent in advance, and identifying the target commodity.

9. A computer device comprising a central processor and a memory, characterized in that the central processor is adapted to invoke execution of a computer program stored in the memory to perform the steps of the method according to any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that it stores, in the form of computer-readable instructions, a computer program implemented according to the method of any one of claims 1 to 7, which, when invoked by a computer, performs the steps comprised by the corresponding method.