CN116720517A

CN116720517A - Search word component recognition model construction method and search word component recognition method

Info

Publication number: CN116720517A
Application number: CN202210188760.4A
Authority: CN
Inventors: 易磊; 黄泽谦; 张伟; 朱秀红; 黄锦鸿
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-02-28
Filing date: 2022-02-28
Publication date: 2023-09-08

Abstract

The application relates to a search word component recognition model construction method and a search word component recognition method. The search word component recognition model construction method comprises the following steps: component identification prediction is carried out on the acquired intra-industry search word samples, and component identification prediction results corresponding to each search word sample are obtained; screening a sample to be marked, of which the component identification prediction result accords with marking conditions, from the search word samples; labeling the sample to be labeled to obtain a labeled search term sample; and carrying out model training according to the marked search word sample to obtain a search word component recognition model which corresponds to the industry and is used for carrying out component recognition on the search word to be recognized. The method can greatly reduce the number of manually marked samples by utilizing active learning, so that the construction efficiency of the search word component recognition model is improved by simplifying data marking operation, and the search word component recognition model supporting efficient component recognition is obtained.

Description

Search word component recognition model construction method and search word component recognition method

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a search word component recognition model construction method and a search word component recognition method.

Background

With the development of artificial intelligence technology, component recognition technology, which may also be called named entity recognition, is mainly used for recognizing entities having meaning in a text, for example, when applied to search words of an article, component recognition may be used for recognizing brand, product, color, size and other attributes in the search words.

In the conventional technology, a common component recognition method comprises dictionary matching and model prediction, wherein the dictionary matching refers to firstly mining a dictionary of each component, then obtaining a component result in a text to be recognized in a text matching mode, and the model prediction refers to training a model based on a large amount of manually marked data to perform component recognition.

However, when the conventional method is applied to search words of objects, because the corresponding important attributes are different for different industries (such as in the mobile phone industry, "CPU (central processing unit), the" camera pixel "is an important attribute thereof, in the cosmetic industry," applicable skin "and" slogan "are important attributes thereof), a great amount of manually marked data used in dictionary matching and component type definition and model prediction are required to be performed when the dictionary of each component is mined, all the manually marked data are required to be manually marked by a worker by using professional knowledge, the operation is complicated, the amount of marked data is large, and the component recognition efficiency is low.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a search term component recognition model construction method, apparatus, computer device, computer readable storage medium, and computer program product that can support efficient component recognition, and a search term component recognition method, apparatus, computer device, computer readable storage medium, and computer program product that improve component recognition efficiency.

In a first aspect, the present application provides a method for constructing a search term component recognition model. The method comprises the following steps:

component identification prediction is carried out on the acquired intra-industry search word samples, and component identification prediction results corresponding to each search word sample are obtained;

screening a sample to be marked, of which the component identification prediction result accords with marking conditions, from the search word samples;

labeling the sample to be labeled to obtain a labeled search term sample;

and carrying out model training according to the marked search word sample to obtain a search word component recognition model which corresponds to the industry and is used for carrying out component recognition on the search word to be recognized.

In a second aspect, the application further provides a search word component recognition model construction device. The device comprises:

The prediction module is used for carrying out component identification prediction on the acquired intra-industry search word samples to obtain component identification prediction results corresponding to each search word sample;

the screening module is used for screening out samples to be marked, of which the component identification prediction results accord with marking conditions, from the search word samples;

the labeling module is used for labeling the sample to be labeled to obtain a labeled search word sample;

and the model training module is used for carrying out model training according to the marked search word sample to obtain a search word component recognition model which corresponds to the industry and is used for carrying out component recognition on the search word to be recognized.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of:

labeling the sample to be labeled to obtain a labeled search term sample;

In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

labeling the sample to be labeled to obtain a labeled search term sample;

In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of:

labeling the sample to be labeled to obtain a labeled search term sample;

In a sixth aspect, the present application provides a method for identifying search term components. The method comprises the following steps:

receiving a search word processing request, wherein the search word processing request carries a search word to be identified;

industry recognition is carried out on search words to be recognized, and target industries are determined;

and carrying out component recognition on the search word to be recognized through a search word component recognition model of the target industry to obtain a search word component recognition result, wherein the search word component recognition model is constructed through the search word component recognition model construction method.

In a seventh aspect, the present application further provides a search term component recognition apparatus. The device comprises:

the receiving module is used for receiving a search word processing request, wherein the search word processing request carries a search word to be identified;

the industry identification module is used for carrying out industry identification on the search word to be identified and determining a target industry;

The component recognition module is used for carrying out component recognition on the search word to be recognized through a search word component recognition model of the target industry to obtain a search word component recognition result, and the search word component recognition model is constructed through the search word component recognition model construction method.

In an eighth aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of:

In a ninth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

In a tenth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of:

According to the method, the device, the computer equipment, the storage medium and the computer program product for constructing the search word component recognition model, the component recognition prediction results corresponding to each search word sample are obtained through component recognition prediction on the acquired search word samples in the industry, the sample to be marked, of which the component recognition prediction results accord with the marking conditions, is screened out from the search word samples, the sample to be marked can be screened out through active learning, and then the sample to be marked can be marked, so that the sample to be marked can be obtained, the number of manually marked samples can be greatly reduced, model training can be carried out according to the marked search word samples, the search word component recognition model corresponding to the industry and used for carrying out component recognition on the search word to be recognized is obtained, the number of manually marked samples can be greatly reduced through active learning in the whole process, the construction efficiency of the search word component recognition model is improved through simplifying data marking operation, and the search word component recognition model supporting efficient component recognition is obtained.

According to the search word component recognition method, the device, the computer equipment, the storage medium and the computer program product, the search word processing request carrying the search word to be recognized is received, the industry recognition is carried out on the search word to be recognized, the target industry is determined, the search word to be recognized is subjected to component recognition through the search word component recognition model of the target industry, the search word component recognition result is obtained, component recognition can be realized by utilizing the search word component recognition model corresponding to the target industry which is constructed efficiently, and the component recognition efficiency can be improved.

Drawings

FIG. 1 is a diagram of an application environment for a search term component recognition model building method in one embodiment;

FIG. 2 is a flow diagram of a method for constructing a search term component recognition model in one embodiment;

FIG. 3 is a schematic diagram of sample item information within the industry in one embodiment;

FIG. 4 is a schematic diagram of a model structure of an initial component recognition model in one embodiment;

FIG. 5 is a flow diagram of attribute mining of sample item information within an industry, in one embodiment;

FIG. 6 is a flow diagram of labeling sample item information within an industry in one embodiment;

FIG. 7 is a flow diagram of a method of identifying search term components in one embodiment;

FIG. 8 is a schematic diagram of a search term component recognition model construction method in one embodiment;

FIG. 9 is a flow diagram of a final model obtained by constructing a search term field component recognition model using a title field component recognition model in one embodiment;

FIG. 10 is a flow chart of a method for constructing a search term component recognition model in another embodiment;

FIG. 11 is a block diagram of a search term composition recognition model construction apparatus in one embodiment;

FIG. 12 is a block diagram of an apparatus for identifying components of search terms in one embodiment;

fig. 13 is an internal structural view of a computer device in one embodiment.

Detailed Description

The application relates to the technical field of artificial intelligence, wherein artificial intelligence (Artificial Intelligence, AI) is a theory, method, technology and application system which utilizes a digital computer or a machine controlled by the digital computer to simulate, extend and expand human intelligence, sense environment, acquire knowledge and acquire an optimal result by using the knowledge. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions. The application mainly relates to natural language processing technology. Natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application. It should be noted that the embodiments of the present application may be applied to various scenarios, including, but not limited to, cloud technology, artificial intelligence, intelligent transportation, driving assistance, and the like.

The search word component recognition model construction method and the search word component recognition method provided by the embodiment of the application can be applied to an application environment shown in fig. 1, when searching is needed, a user can send a search word processing request carrying a search word to be recognized (XX mobile phone golden four-shot full-network general 8+128, wherein XX is a brand word) to the terminal through an interface displayed on the terminal, after receiving the search word processing request, the terminal can perform industry recognition on the search word to be recognized, determine a target industry corresponding to the search word to be recognized as a mobile phone industry, further obtain a search word component recognition model of the constructed mobile phone industry, perform component recognition on the search word to be recognized through the search word component recognition model of the mobile phone industry, and obtain a search word component recognition result (XX { brand } mobile phone { product positioning } golden { color } four-shot { camera number } full-network general { operator }8{ running memory } +128{ body memory }). The terminal can be, but not limited to, various desktop computers, notebook computers, smart phones, tablet computers, internet of things equipment, portable wearable equipment, aircrafts and the like, and the internet of things equipment can be an intelligent sound box, an intelligent television, an intelligent air conditioner, intelligent vehicle-mounted equipment and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like.

In one embodiment, as shown in fig. 2, a method for constructing a search term component recognition model is provided, where the method is applied to a terminal for illustrating, it is understood that the method may also be applied to a server, and may also be applied to a system including the terminal and the server, and implemented through interaction between the terminal and the server. The server may be implemented by a stand-alone server or a server cluster formed by a plurality of servers, or may be a node on a blockchain. In this embodiment, the method includes the steps of:

step 202, performing component identification prediction on the acquired intra-industry search word samples to obtain component identification prediction results corresponding to each search word sample.

The search term refers to text used for searching and input by a user in a search engine or the like. For example, a search term may specifically refer to text for searching that contains content of brand words, product words, colors, sizes, etc. Intra-industry search term samples refer to search term samples that are not labeled within the industry. For example, the intra-industry search term sample may specifically refer to unlabeled search terms obtained from an intra-industry historical search record. Component recognition prediction refers to predicting meaningful entities in a search word sample in industry through component recognition, wherein the meaningful entities can be brand words, product words, colors, sizes and the like.

The component recognition prediction result refers to probability that each single word in the search word sample belongs to each labeling type, and the labeling types are determined according to attribute information corresponding to industries. For example, references herein may refer specifically to BIO (B-begin, I-side, O-outer) references, which refer to each element (in this embodiment, each word) as "B-X", "I-X", or "O". Wherein "B-X" indicates that the fragment in which the element is located is of the X type and that the element is at the beginning of the fragment, "I-X" indicates that the fragment in which the element is located is of the X type and that the element is at the middle of the fragment, "O" indicates that the element is not of any type. It should be noted that, the types in the BIO label are attributes, for example, X may specifically represent attributes such as brand words (brands) and product words (products), and the label types of the BIO may be: b-brand (beginning of brand word), I-brand (middle of brand word), B-product: (beginning of product word), I-product (middle of product word), O (not noun phrase).

Specifically, the terminal performs component identification prediction on the acquired intra-industry search word samples, so that a component identification prediction result corresponding to each search word sample can be obtained. Further, the terminal can conduct component identification prediction on the acquired intra-industry search word samples through the article information component identification model corresponding to the industry, and a component identification prediction result corresponding to each search word sample is obtained. The article information component identification model is obtained by training acquired intra-industry sample article information, and the intra-industry sample article information refers to article information which is taken as a sample in the industry. For example, when the industry is a home appliance industry, the sample article information in the industry may specifically refer to home appliance article information. Further by way of example, as shown in fig. 3, when the household appliance is a microwave oven, the household appliance information includes an article title information and an article description information (i.e., product parameters), where the article description information specifically includes attributes such as a product model, an external dimension, a door opening mode, a net weight of a product, a rated input power, a rated output power, a microwave operating frequency, a barbecue power, a volume, authentication, a product noise value, a rated voltage/frequency, and the like, and attribute parameters corresponding to the attributes.

And 204, screening out samples to be annotated, of which the component identification prediction results accord with the annotation conditions, from the search word samples.

The labeling condition refers to a condition needing manual labeling, and the labeling condition can be set according to the requirement. For example, the labeling condition may specifically be that an average value of information entropy of each individual word in the search word sample is greater than a preset information entropy threshold. For another example, the labeling condition may specifically be that the minimum confidence level of the search term sample is less than a preset confidence threshold. For another example, the labeling condition may specifically be that the minimum probability difference calculated according to the edge sampling principle is smaller than a preset probability difference threshold. The preset information entropy threshold value, the preset confidence coefficient threshold value and the preset probability difference value threshold value can be set according to the needs.

The information entropy refers to the average information amount after redundancy is eliminated from the information. The minimum confidence of the search word sample refers to the minimum value in the maximum probability determined by sequencing the maximum probability that each single word in the search word sample belongs to each annotation type. In the component recognition prediction, a most probable labeling type and probability, that is, a maximum probability, are predicted for each word. Edge sampling refers to selecting those search term samples that are more easily determined to be two classes, i.e., the probability that the search term sample is determined to be two classes does not differ much. The minimum probability difference value refers to the minimum value of the probability difference value of each single word in the search word sample, and the probability difference value of each single word refers to the difference value between the maximum probability and the next-largest probability in the probabilities of the single words belonging to the annotation types.

Specifically, the terminal screens out a sample to be marked, of which the component identification prediction result meets the marking condition, from the search word samples according to the marking condition. When the labeling condition is that the average value of the information entropy of each single word in the search word sample is larger than a preset information entropy threshold value, the terminal calculates the average value of the information entropy of each single word in the search word sample, and the sample to be labeled which accords with the labeling condition is screened out by comparing the average value of the information entropy of each single word in the search word sample with the preset information entropy threshold value.

When the labeling condition is that the minimum confidence coefficient of the search word sample is smaller than a preset confidence coefficient threshold value, the terminal determines the maximum probability that each single word in the search word sample belongs to each labeling type according to the component recognition prediction result, sorts the maximum probability that each single word in the search word sample belongs to each labeling type, determines the minimum value in the maximum probability as the minimum confidence coefficient of the search word sample, and screens out the sample to be labeled which accords with the labeling condition by comparing the minimum confidence coefficient of the search word sample with the preset confidence coefficient threshold value. For example, assume that the maximum probability that each individual word in a search word sample is assigned to a respective annotation type is: { B-brand 0.7, I-brand 0.6, B-product 0.7, I-product 0.4, O:0.6, O:0.2}, the minimum confidence level of the search term sample is 0.2.

When the labeling condition is that the minimum probability difference calculated according to the edge sampling principle is smaller than a preset probability difference threshold, the terminal calculates the probability difference of each single word according to the component recognition prediction result, sorts the probability differences of each single word to obtain the minimum value serving as the minimum probability difference, and screens out the sample to be labeled which accords with the labeling condition by comparing the minimum probability difference with the preset probability difference threshold. For example, assuming that the probability of belonging to each label type is {0.4,0.3,0.1,0.1,0.1} for a single word "X", the difference between the maximum probability (0.4) and the next-largest probability (0.3), i.e., the probability difference is 0.1.

And 206, labeling the sample to be labeled to obtain a labeled search term sample.

Specifically, after determining the sample to be marked, the terminal marks the sample to be marked to obtain a marked search word sample. The method for labeling the sample to be labeled may be that the sample to be labeled and the component identification prediction result of the sample to be labeled are pushed to a user side, so as to prompt the user to perform manual labeling. For example, the terminal may directly display the sample to be marked and the component identification prediction result of the sample to be marked on the corresponding display screen for the user to manually mark. It should be noted that, here, because when the manual labeling is performed, besides the sample to be labeled, the component identification prediction result of the sample to be labeled is displayed at the same time for the user to perform labeling reference, the difficulty of manual labeling can be obviously reduced, and the manpower resource is saved.

And step 208, training a model according to the marked search word sample to obtain a search word component recognition model which corresponds to the industry and is used for component recognition of the search word to be recognized.

Specifically, the terminal performs model training according to the marked search word sample, so that a search word component recognition model corresponding to the industry and used for component recognition of the search word to be recognized can be obtained. Further, in order to improve the component recognition accuracy of the search term component recognition model, the model training may specifically be an iterative training process, that is, after the model training is performed by labeling the search term sample, the terminal may first obtain a component recognition model to be optimized, then obtain a new search term sample, perform model training on the component recognition model to be optimized through the new obtained search term sample to obtain a new component recognition model to be optimized, return to perform model training on the component recognition model to be optimized through the new obtained search term sample to obtain a new component recognition model to be optimized, until the latest component recognition model to be optimized reaches a preset stop iterative training condition, and obtain the component recognition model of the search term corresponding to the industry.

The preset iteration stopping training condition can be set by itself according to the requirement, and the embodiment is not particularly limited here. For example, the preset stopping iteration training condition may specifically be that the model accuracy of the latest component identification model to be optimized reaches a preset accuracy target, where the preset accuracy target may be set by itself according to needs, for example, the preset accuracy target may specifically be 95%. The model precision can be obtained by calculating the precision index of the latest component recognition model to be optimized on a predefined test set, wherein the predefined test set refers to a predefined test set, and the predefined test set comprises search words and corresponding component recognition results.

Furthermore, if the terminal performs component recognition prediction on the acquired intra-industry search word sample through the corresponding article information component recognition model of the industry, the article information component recognition model can be directly trained according to the labeled search word sample when the model is trained according to the labeled search word sample.

According to the method for constructing the search word component recognition model, component recognition prediction is carried out on the acquired search word samples in the industry to obtain the component recognition prediction results corresponding to each search word sample, the sample to be marked, of which the component recognition prediction results meet the marking conditions, is screened out from the search word samples, the sample to be marked can be screened out through active learning, and then the sample to be marked can be marked only, so that the sample number of manual marking can be greatly reduced, model training can be carried out according to the marked search word samples, the search word component recognition model corresponding to the industry and used for carrying out component recognition on the search word to be recognized is obtained, the number of samples of manual marking can be greatly reduced through active learning in the whole process, the construction efficiency of the search word component recognition model is improved through simplifying data marking operation, and the search word component recognition model supporting efficient component recognition is obtained.

In one embodiment, screening the sample to be annotated, of which the component identification prediction result meets the annotation condition, from the search word samples includes:

obtaining the probability that each single word in the search word sample belongs to each labeling type according to the component recognition prediction result;

according to the probability of belonging to each annotation type, respectively calculating the information entropy of each single word;

and calculating the average value of the information entropy of each single word in the search word samples, and screening out the search word samples with the average value larger than a preset information entropy threshold value as samples to be marked.

The annotation types are all possible annotations determined according to the attribute information corresponding to the industry. For example, when the annotation is a BIO annotation, the annotation type may specifically be composed of B/I/O-X, where X represents an attribute, "B-X" represents that the segment in which the element is located is of the X type and that the element is at the beginning of the segment, "I-X" represents that the segment in which the element is located is of the X type and that the element is at the middle of the segment, "O" represents that the element is not of any type.

Specifically, the component recognition prediction result includes the probability that each single word in the search word sample belongs to each labeling type, so the terminal can directly obtain the probability that each single word in the search word sample belongs to each labeling type according to the component recognition prediction result, after obtaining the probability that each single word belongs to each labeling type, the terminal can respectively calculate the information entropy of each single word according to the probability and the information entropy formula of each labeling type, calculate the average value of the information entropy of each single word in the search word sample, compare the average value with a preset information entropy threshold value, and screen out that the average value is larger than the average value And presetting a search word sample with an information entropy threshold as a sample to be marked. The information entropy formula is as follows:p(x _i ) To attribute to the probability of each tag type, n is the number of tag types to which it belongs. The average value of the information entropy of each single word in the search word sample is the ratio of the sum of the information entropy of all the single words in the search word sample to the number of the single words in the search word sample. In this embodiment, the probability that each single word in the search word sample belongs to each labeling type is obtained according to the component recognition prediction result, the information entropy of each single word is calculated according to the probability that each single word belongs to each labeling type, and the average value of the information entropy of each single word in the search word sample is calculated.

In one embodiment, performing component identification prediction on the obtained intra-industry search term samples to obtain respective component identification prediction results corresponding to each search term sample includes:

component identification prediction is carried out on the acquired intra-industry search word samples through an article information component identification model corresponding to the industry, so that component identification prediction results corresponding to each search word sample are obtained;

Wherein, the article information component identification model is constructed by the following modes:

carrying out attribute mining on the acquired sample article information in the industry to obtain attribute information corresponding to the industry and an attribute dictionary corresponding to the attribute information, wherein the attribute dictionary comprises attribute parameters;

labeling the sample article information in the industry according to the attribute information and the attribute parameters in the attribute dictionary to obtain an article information training sample;

and training the initial component identification model through the article information training sample to obtain an article information component identification model corresponding to the industry.

The industry-corresponding object information component recognition model is a model which is obtained through training and can be used for component recognition of object information. The sample article information in the industry refers to article information which is taken as a sample in the industry and comprises article title information and article description information, wherein the article title information refers to titles used in describing articles, for example, the article title information can be in the form of { [ brand word ] [ volume size ] [ product word ] [ power size ] }. The article description information refers to information for describing the article in detail, namely article parameters, including the attribute of the article and attribute parameters corresponding to the attribute. For example, the item description information may specifically be a product parameter. For example, for a household microwave oven, the article description information may specifically include attributes such as a product model, an external dimension, a door opening mode, a product net weight, a rated input power, a rated output power, a microwave operating frequency, a barbecue power, a volume, authentication, a product noise value, a rated voltage/frequency, and attribute parameters corresponding to the attributes.

Wherein, the attribute refers to a property possessed by the article. Industry-corresponding attribute information refers to properties of most articles in the industry. For example, in the home electric industry, rated input power, rated output power and the like are attribute information corresponding to the industry. The attribute dictionary corresponding to the attribute information refers to a set of attribute parameters corresponding to the attribute information. The attribute parameter refers to an attribute value corresponding to the attribute information. For example, for rated input power and rated output power, the corresponding attribute parameters are specific power values.

The initial component recognition model refers to a component recognition model which is not trained yet. For example, the initial component recognition model may be specifically a model based on BERT (Bidirectional Encoder Representation from Transformers, bi-directional coded Transformers) +crf (Conditional Random Field ), where the model structure is shown in fig. 4, the model based on bert+crf is a model that can be used for sequence labeling, and the input layer takes each word in the text sequence (i.e. the article information training sample in this embodiment) as the input of BERT, and then performs an encoding operation, that is, a feature extraction operation, on the input text sequence by using BERT, after passing through the bi-directional BERT layer, decoding is performed by using a CRF layer, taking the features extracted by the BERT layer as the input, and calculating the label of each element in the text sequence, that is, the probability that each word in the article information training sample belongs to each labeling type, by using the features extracted by the CRF layer.

Specifically, the terminal can perform component identification prediction on the acquired intra-industry search word samples through the article information component identification model corresponding to the industry, so as to obtain component identification prediction results corresponding to each search word sample. The method for constructing the article information component identification model can be as follows: the method comprises the steps that attributes appearing in sample article information in the industry are taken as candidate attributes by a terminal, then the sample article information in the industry is traversed according to the candidate attributes, attribute information corresponding to the industry and an attribute dictionary corresponding to the attribute information are mined from the candidate attributes, then text matching is conducted on article title information in the sample article information in the industry by using attribute parameters in the attribute dictionary, target attribute parameters matched with the attribute parameters (namely, the attribute parameters appearing in the article title information) are obtained, corresponding target attribute information is determined from the attribute information according to the target attribute parameters, the article title information is marked according to the target attribute parameters and the target attribute information, an article information training sample is obtained, and supervision training is conducted on an initial component identification model through the article information training sample, so that an article information component identification model corresponding to the industry is obtained.

The training with supervision is performed on the initial component recognition model through the article information training sample, namely, training is performed by taking article title information in the article information training sample as input and taking labels of the article title information as labels with supervision. Specifically, when the supervision training is performed, component identification prediction is performed on the item title information in the item information training sample through the initial component identification model to obtain component identification prediction results corresponding to each item title information, a model loss function is calculated by comparing the component identification prediction results with labels of the item title information, and parameter adjustment is performed on the initial component identification model according to the model loss function until the initial component identification model reaches a preset stopping training condition, so that the item information component identification model corresponding to the industry is obtained. Wherein, the preset training stopping condition can be set by oneself according to the requirement. For example, the preset training stopping condition may specifically be that the model loss function is smaller than a preset loss function threshold, where the preset loss function threshold may be set by itself as required. For another example, the preset training stopping condition may specifically be convergence of the model loss function.

In this embodiment, attribute mining is performed on the obtained intra-industry sample article information, so that attribute information corresponding to the industry and an attribute dictionary corresponding to the attribute information can be obtained, the intra-industry sample article information is marked according to the attribute information and the attribute parameters in the attribute dictionary, so that an article information training sample is obtained, an initial component recognition model can be trained by using the article information training sample, construction of an article information component recognition model corresponding to the industry is realized, component recognition prediction can be performed on the obtained intra-industry search term sample by using the article information component recognition model corresponding to the industry, and a component recognition prediction result corresponding to each search term sample is obtained.

In one embodiment, performing attribute mining on the acquired sample article information in the industry to obtain attribute information corresponding to the industry and an attribute dictionary corresponding to the attribute information includes:

taking the attribute appearing in the sample article information in the industry as a candidate attribute, traversing the sample article information in the industry, and respectively obtaining the attribute characteristics and the attribute parameter set of each candidate attribute;

calculating the attribute importance of the candidate attribute according to the attribute characteristics;

Screening target attributes meeting the attribute importance screening conditions from the candidate attributes according to the attribute importance;

combining the target attributes to obtain attribute information;

and aggregating the attribute parameter sets according to the attribute information to obtain an attribute dictionary matched with the attribute information. The attribute features are features for representing characteristics of candidate attributes in sample item information in the industry, and comprise at least two of distribution frequency features, title co-occurrence features and word weight features, wherein the distribution frequency features refer to frequencies of occurrence of the candidate attributes in item description information in sample item information in the industry, the title co-occurrence features refer to co-occurrence frequencies of attribute parameters corresponding to the candidate attributes in item title information in sample item information in the industry, namely frequencies of occurrence of attribute parameters corresponding to the candidate attributes in the item title information, and the word weight features refer to average word weights of attribute parameters corresponding to the candidate attributes in item title information in sample item information in the industry, and are used for representing importance of the candidate attributes in sample item information in the industry. It should be noted that, the distribution frequency feature and the title co-occurrence feature may be obtained by directly counting after traversing the sample article information in the industry, and the word weight feature needs to be calculated by using a pre-trained word weight model, the weight of each word in the sample article information in the industry may be obtained by inputting the sample article information in the industry into the pre-trained word weight model, and the word weight feature may be obtained by averaging the weights of each word. The pre-trained word weight model can be trained by itself according to the requirement, and the embodiment is not particularly limited herein, so long as the pre-trained word weight model can output weights of words. The attribute importance is used to describe the importance of an attribute.

Specifically, the terminal uses all the attributes appearing in the sample article information in the industry as candidate attributes, and traverses the sample article information in the industry according to the candidate attributes to respectively obtain attribute characteristics and attribute parameter sets of each candidate attribute, wherein the attribute characteristics comprise at least two of distribution frequency characteristics, title co-occurrence characteristics and word weight characteristics. After the attribute features are obtained, the terminal performs weighted average on the attribute features, calculates the attribute importance of the candidate attributes, and screens out target attributes meeting attribute importance screening conditions from the candidate attributes according to the attribute importance, wherein the attribute importance screening conditions can be set by themselves according to requirements, for example, the attribute importance screening conditions can be specifically that the attribute importance is greater than a preset importance threshold, and the preset importance threshold can be set by itself according to requirements.

Specifically, after the target attributes are obtained, as semantically close attributes, such as "flavor" and "fragrance", may exist in all the target attributes, in order to ensure the distinguishing property between the components (i.e., the attributes) identified by the components, the similar attributes need to be combined, the terminal calculates the attribute similarity between the target attributes, and combines the similar attributes in the target attributes according to the attribute similarity, so as to obtain the attribute information. After obtaining the attribute information, the terminal aggregates the attribute parameter sets according to the attribute information, namely, aggregates the attribute parameter sets of the combined target attribute together to obtain an attribute dictionary corresponding to the attribute information. Further, in the process of obtaining the attribute dictionary, after the attribute parameter sets are aggregated, the terminal cleans attribute parameters in the aggregated attribute parameter sets, and obtains the attribute dictionary corresponding to the attribute information according to the cleaned attribute parameters. The cleaning mode includes, but is not limited to, text normalization, stop word filtering, nonsensical word filtering and the like. For example, the text normalization may be case-to-case conversion, special symbol processing, and the like.

For example, a schematic flow chart of attribute mining of sample item information in industry may be shown in fig. 5. The terminal firstly filters the attribute importance according to sample article information in the industry, at this time, the terminal firstly takes the attribute appearing in the sample article information in the industry (namely the article detail page data in fig. 5, it should be noted that the article detail page data specifically can refer to the detail page data when the article is displayed on the terminal upper interface) as a candidate attribute, traverses the sample article information in the industry to respectively obtain attribute characteristics and attribute parameter sets of each candidate attribute, calculates the attribute importance of the candidate attribute according to the attribute characteristics, and finally screens out target attributes meeting the attribute importance screening conditions from the candidate attributes according to the attribute importance, thereby completing attribute importance filtering. After the attribute importance filtering is completed, the terminal merges the target attributes, namely, similar attributes, to obtain attribute information (namely, important attributes in fig. 5). After obtaining the attribute information, the terminal aggregates the attribute parameter sets according to the attribute information, takes the aggregated attribute parameter sets as an attribute dictionary corresponding to the attribute information, and cleans the attribute dictionary to obtain a final attribute dictionary corresponding to the attribute information.

In this embodiment, by using the attribute appearing in the sample article information in the industry as a candidate attribute, traversing the sample article information in the industry to obtain the attribute feature and the attribute parameter set of each candidate attribute, and calculating the attribute importance of the candidate attribute according to the attribute feature, the target attribute meeting the attribute importance screening condition can be screened out from the candidate attribute according to the attribute importance, the attribute information can be obtained by combining the target attributes, and the attribute parameter set can be aggregated according to the attribute information to obtain the attribute dictionary matched with the attribute information. In one embodiment, merging the target attributes to obtain attribute information includes:

calculating attribute similarity between the target attributes;

and merging the similar attributes according to the attribute similarity to obtain attribute information.

Wherein attribute similarity is used to describe the degree of similarity between attributes.

Specifically, when merging target attributes, the terminal calculates attribute similarity between any two target attributes, and merges similar attributes in the target attributes according to the attribute similarity to obtain attribute information. When calculating the attribute similarity, word2vec (related model used for generating word vectors) word vector similarity may be used for calculation (that is, word vectors of target attributes are calculated through a pre-trained word2vec model, and then attribute similarity is calculated through the word vectors of target attributes), or similarity scores between target attributes may be calculated through a pre-trained BERT model, which is not limited in this embodiment.

Further, when merging the target attributes, the terminal firstly sorts the target attributes according to the attribute importance as a candidate sequence, and regenerates a null sequence as a result sequence, and selects one target attribute with the largest importance from the candidate sequence for judgment every time, if the result sequence is null, the target attribute with the largest importance is directly added into the result sequence, and the target attribute is deleted from the candidate sequence, if the result sequence is not null, the attribute similarity between the target attribute with the largest importance and each target attribute in the result sequence is checked, if the attribute similarity is smaller than a preset similarity threshold, the target attribute with the largest importance is added into the result sequence, otherwise, the target attribute with the largest importance is abandoned, and after all the target attributes in the candidate sequence are traversed, the target attribute in the result sequence is used as attribute information.

In this embodiment, by calculating the attribute similarity between the target attributes, the similar attributes in the target attributes can be combined according to the attribute similarity, so as to obtain attribute information.

In one embodiment, labeling the sample article information in the industry according to the attribute information and the attribute parameters in the attribute dictionary, and obtaining the article information training sample comprises:

Performing word segmentation on object title information in sample object information in industry to obtain a title word segmentation result;

text matching is carried out on the title word segmentation result through attribute parameters in the attribute dictionary, so that a text matching result is obtained, and the text matching result comprises target attribute parameters matched with the attribute parameters;

determining corresponding target attribute information from the attribute information according to the target attribute parameters;

and labeling the object title information according to the object attribute parameters and the object attribute information to obtain an object information training sample.

Specifically, the terminal performs word segmentation on object title information in sample object information in industry to obtain a title word segmentation result, performs text matching on the title word segmentation result through attribute parameters in an attribute dictionary to obtain a preliminary matching result, checks whether the word segmentation in the preliminary matching result conflicts with the title word segmentation result, removes unreasonable matching results, and reserves the largest matching result in the preliminary matching result to be used as a text matching result. The maximum matching result may specifically be a maximum forward matching result.

Specifically, after the text matching result is obtained, the terminal determines corresponding target attribute information from the attribute information according to the target attribute parameters matched with the attribute parameters in the text matching result, marks the corresponding target attribute information at the positions of the target attribute parameters in the object title information, and obtains an object information training sample. Taking a microwave oven as an example to illustrate an article information training sample, the article information training sample can be in the form of 'brand word [ brand ] 23L [ volume ] variable frequency household microwave oven [ product ]900w (watt) [ power ]'.

For example, a schematic flow chart for labeling sample article information in the industry in this embodiment may be shown in fig. 6. The method comprises the steps that firstly, a terminal performs word segmentation on object title information in sample object information in industry to obtain a title word segmentation result, text matching is performed on the title word segmentation result through attribute parameters in an attribute dictionary (specifically exemplified as an attribute A dictionary, an attribute B dictionary, an attribute C dictionary and an attribute D dictionary in fig. 6) to obtain a preliminary matching result, word segmentation checking is performed on the preliminary matching result (namely, whether the word segmentation in the preliminary matching result conflicts with the title word segmentation result or not is checked, an unreasonable matching result is removed), the largest matching result in the preliminary matching result is reserved and is used as a text matching result, corresponding target attribute information is determined from the attribute information according to target attribute parameters matched with the attribute parameters in the text matching result, the object title information is marked according to the target attribute parameters and the target attribute information, and an object information training sample is obtained, and the object information is output.

In this embodiment, article title information in sample article information in industry is segmented to obtain a title segmentation result, text matching is performed on the title segmentation result through attribute parameters in an attribute dictionary, a text matching result including target attribute parameters matched with the attribute parameters can be obtained, further corresponding target attribute information can be determined from the attribute information according to the target attribute parameters, and the article title information is labeled according to the target attribute parameters and the target attribute information, so as to obtain an article information training sample.

In one embodiment, performing model training according to the labeled search word sample to obtain a search word component recognition model corresponding to industry and used for component recognition of a search word to be recognized comprises:

carrying out model training on the component identification model of the article information by marking the search word sample to obtain a component identification model to be optimized;

performing model training on the component recognition model to be optimized through the newly acquired search word sample to obtain a new component recognition model to be optimized;

and returning to the step of carrying out model training on the component recognition model to be optimized through the newly acquired search word sample to obtain a new component recognition model to be optimized until the latest component recognition model to be optimized reaches the preset iteration stopping training condition, so as to obtain a search word component recognition model which corresponds to the industry and is used for carrying out component recognition on the search word to be recognized.

Specifically, when model training is performed according to the labeled search word sample, an iterative training mode can be adopted, the terminal can perform model training on the article information component recognition model through the labeled search word sample to obtain a component recognition model to be optimized, then perform model training on the component recognition model to be optimized through the newly acquired search word sample to obtain a new component recognition model to be optimized, and then return to the step of performing model training on the component recognition model to be optimized through the newly acquired search word sample to obtain a new component recognition model to be optimized until the latest component recognition model to be optimized reaches a preset stop iterative training condition to obtain the search word component recognition model corresponding to the industry.

When a component recognition model to be optimized is trained through a newly acquired search word sample, the terminal firstly carries out component recognition prediction on the newly acquired search word sample through the component recognition model to be optimized to obtain component recognition prediction results corresponding to each search word sample, the sample to be marked, of which the component recognition prediction results accord with marking conditions, is screened out from the search word sample, the sample to be marked is marked to obtain a marked search word sample, and model training is carried out on the component recognition model to be optimized according to the marked search word sample to obtain the new component recognition model to be optimized.

In this embodiment, the component recognition accuracy of the model can be improved by obtaining the search word component recognition model corresponding to the industry through iterative training.

In one embodiment, as shown in fig. 7, a search term component recognition method is provided, where the method is applied to a terminal to illustrate the method, it is understood that the method may also be applied to a server, and may also be applied to a system including the terminal and the server, and implemented through interaction between the terminal and the server. The terminal can be, but not limited to, various desktop computers, notebook computers, smart phones, tablet computers, internet of things equipment, portable wearable equipment, aircrafts and the like, and the internet of things equipment can be an intelligent sound box, an intelligent television, an intelligent air conditioner, intelligent vehicle-mounted equipment and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server may be implemented by a stand-alone server or a server cluster formed by a plurality of servers, or may be a node on a blockchain. In this embodiment, the method includes the steps of:

Step 702, a search term processing request is received, where the search term processing request carries a search term to be identified.

Wherein, the search term processing request refers to a request for processing a search term. For example, a search term processing request may refer specifically to a processing request initiated by a user through a search engine. The search term to be recognized refers to a search term that needs component recognition.

Specifically, when a search is required, a user sends a search word processing request carrying a search word to be identified to a terminal, and the terminal receives the search word processing request. For example, a user may send a search term processing request carrying a search term to be identified to a terminal through a search engine interface displayed on the terminal.

Step 704, industry recognition is performed on the search word to be recognized, and a target industry is determined.

Specifically, after receiving the search word processing request, the terminal performs industry recognition on the search word to be recognized so as to determine a target industry corresponding to the search word to be recognized. Furthermore, the terminal can perform industry recognition on the search word to be recognized through a pre-training industry recognition model, the pre-training industry recognition model is obtained through training an industry recognition sample, and the industry recognition sample comprises a text carrying an industry label.

Step 706, component recognition is performed on the search word to be recognized through the search word component recognition model of the target industry, so as to obtain a search word component recognition result, and the search word component recognition model is constructed through the search word component recognition model construction method.

Specifically, after determining the target industry, the terminal acquires a search word component recognition model of the target industry, and performs component recognition on the search word to be recognized through the search word component recognition model of the target industry to obtain a search word component recognition result.

According to the search word component recognition method, the search word processing request carrying the search word to be recognized is received, the industry recognition is carried out on the search word to be recognized, the target industry is determined, the component recognition is carried out on the search word to be recognized through the search word component recognition model of the target industry, the search word component recognition result is obtained, the component recognition can be realized by utilizing the search word component recognition model corresponding to the target industry which is constructed efficiently, and the component recognition efficiency can be improved.

Further, the search word component recognition result obtained by the search word component recognition method can be applied to article recall and correlation calculation in a search scene. The article containing the specific attribute is recalled according to the attribute extracted from the search word, so that the recall accuracy can be improved, the ordering pressure of the article is reduced, and meanwhile, the relevance between the search word and the article can be calculated as a characteristic. For example, under the electronic market scene, the articles are commodities, the applicable skin type oil can be identified through the search word oil skin facial cleanser, and the recall accuracy can be improved only by recalling the commodities applicable to the oil skin type during search.

In one embodiment, as shown in fig. 8, the principle of the search term component recognition model construction method of the present application is illustrated by a schematic diagram, and the search term component recognition model construction method of the present application mainly includes the steps of: the method comprises the steps that industrial article basic data (namely article information of an industrial sample) are firstly obtained by a terminal, attribute importance calculation and dictionary mining are carried out on the obtained industrial article basic information, an article information training sample is obtained, an initial component recognition model is trained through the article information training sample, a title domain component recognition model (namely article information component recognition model) is built, and then a search word domain component recognition model is built by utilizing the title domain component recognition model, so that a final model (namely a search word component recognition model corresponding to the industry and used for component recognition of search words to be recognized) is obtained.

Further, by using the title domain component recognition model, a search word domain component recognition model is constructed, and a flowchart of the final model is shown in fig. 9. The terminal performs component recognition prediction on the obtained unlabeled search word (i.e. the search word sample in the industry) through the title domain component recognition model to obtain component recognition prediction results corresponding to each search word sample, screens out samples to be labeled, of which the component recognition prediction results accord with labeling conditions, labels the samples to be labeled to obtain labeled search words (i.e. the labeled search word samples), and performs model training on the title domain component recognition model according to the labeled search word samples to obtain search word component recognition models corresponding to the industry and used for component recognition of the search words to be recognized.

Further, when the title domain component recognition model is subjected to model training according to the labeling search word sample to obtain a search word component recognition model corresponding to the industry and used for component recognition of the search word to be recognized, the terminal performs model training on the title domain component recognition model through the labeling search word sample to obtain a component recognition model to be optimized, performs model training on the component recognition model to be optimized through the newly acquired search word sample to obtain a new component recognition model to be optimized, and returns to perform model training on the component recognition model to be optimized through the newly acquired search word sample to obtain a new component recognition model to be optimized until the latest component recognition model to be optimized reaches a preset stop iteration training condition to obtain the search word component recognition model corresponding to the industry and used for component recognition of the search word to be recognized.

In one embodiment, as shown in fig. 10, the method for constructing a search term component recognition model according to the present application is illustrated by a schematic flow chart, and specifically includes the following steps:

step 1002, using the attribute appearing in the sample article information in the industry as a candidate attribute, traversing the sample article information in the industry, and respectively obtaining the attribute feature and the attribute parameter set of each candidate attribute.

Specifically, the terminal uses all the attributes appearing in the sample article information in the industry as candidate attributes, and traverses the sample article information in the industry according to the candidate attributes to respectively obtain attribute characteristics and attribute parameter sets of each candidate attribute, wherein the attribute characteristics comprise at least two of distribution frequency characteristics, title co-occurrence characteristics and word weight characteristics.

Step 1004, calculating the attribute importance of the candidate attribute according to the attribute characteristics.

Specifically, after obtaining the attribute features, the terminal performs weighted average on the attribute features, and calculates the attribute importance of the candidate attribute.

Step 1006, screening out the target attribute meeting the attribute importance screening condition from the candidate attributes according to the attribute importance.

The attribute importance degree screening conditions can be set automatically according to the needs, for example, the attribute importance degree screening conditions can be specifically that the attribute importance degree is larger than a preset importance degree threshold, wherein the preset importance degree threshold can be set automatically according to the needs.

In step 1008, attribute similarity between the target attributes is calculated.

Specifically, when calculating the attribute similarity, the terminal may calculate the word vector similarity by using word2vec (a related model used for generating word vectors) (i.e., calculate the word vector of the target attribute by using the pre-trained word2vec model, and then calculate the attribute similarity by using the word vector of the target attribute), or calculate the similarity score between the target attributes by using the pre-trained BERT model.

And step 1010, merging similar attributes in the target attributes according to the attribute similarity to obtain attribute information.

Specifically, the terminal firstly sorts the target attributes according to the attribute importance degree, uses the target attributes as candidate sequences, regenerates a null sequence as a result sequence, selects one target attribute with the largest importance degree from the candidate sequences each time, judges the result sequence, directly adds the target attribute with the largest importance degree into the result sequence if the result sequence is null, deletes the target attribute from the candidate sequence, checks the attribute similarity of the target attribute with the largest importance degree and each target attribute in the result sequence if the result sequence is not null, adds the target attribute with the largest importance degree into the result sequence if the attribute similarity is smaller than a preset similarity threshold, otherwise discards the target attribute with the largest importance degree, and takes the target attribute in the result sequence as attribute information after traversing all the target attributes in the candidate sequence.

Step 1012, aggregating the attribute parameter sets according to the attribute information to obtain an attribute dictionary matched with the attribute information.

Specifically, after obtaining the attribute information, the terminal aggregates the attribute parameter set according to the attribute information, and cleans the aggregated attribute parameter set to obtain an attribute dictionary matched with the attribute information.

Step 1014, word segmentation is performed on the article title information in the sample article information in the industry, and a title word segmentation result is obtained.

Specifically, the terminal can segment the object title information in the sample object information in the industry in a word segmentation mode such as barking word segmentation to obtain a title word segmentation result.

In step 1016, text matching is performed on the title word segmentation result through the attribute parameters in the attribute dictionary, so as to obtain a text matching result, wherein the text matching result comprises target attribute parameters matched with the attribute parameters.

Specifically, the terminal performs text matching on the title word segmentation result through attribute parameters in the attribute dictionary to obtain a preliminary matching result, checks whether the word segmentation in the preliminary matching result conflicts with the title word segmentation result, removes unreasonable matching results, and reserves the maximum matching result in the preliminary matching result as a text matching result.

Step 1018, determining corresponding target attribute information from the attribute information according to the target attribute parameters.

Specifically, the terminal can determine the target attribute information corresponding to the target attribute parameter from the attribute information according to the target attribute parameter.

And 1020, labeling the item title information according to the target attribute parameters and the target attribute information to obtain an item information training sample.

Specifically, the terminal marks corresponding target attribute information at the position of the target attribute parameter in the item title information to obtain an item information training sample. Taking a microwave oven as an example to illustrate an article information training sample, the article information training sample can be in the form of 'brand word [ brand ]23 liter [ volume ] variable frequency household microwave oven [ product ]900w [ power ]'.

Step 1022, training the initial component identification model through the article information training sample to obtain an article information component identification model corresponding to the industry.

Specifically, the terminal performs supervised training on the initial component identification model through the article information training sample to obtain an article information component identification model corresponding to the industry.

Step 1024, performing component recognition prediction on the acquired intra-industry search word samples through the industry corresponding object information component recognition model to obtain respective corresponding component recognition prediction results of each search word sample.

And 1026, obtaining the probability that each single word in the search word sample belongs to each annotation type according to the component recognition prediction result.

The component recognition prediction result comprises the probability that each single word in the search word sample belongs to each annotation type.

Step 1028, respectively calculating the information entropy of each single word according to the probability of each labeling type.

Specifically, the terminal calculates the information entropy of each word according to the probability and the information entropy formula of each labeling type. The information entropy formula is as follows:p(x _i ) To attribute to the probability of each tag type, n is the number of tag types to which it belongs.

Step 1030, calculating an average value of the information entropy of each individual word in the search word samples, and screening out the search word samples with average value larger than a preset information entropy threshold as the samples to be marked.

Specifically, the average value of the information entropy of each individual word in the search word sample is the ratio of the sum of the information entropy of all individual words in the search word sample to the number of individual words in the search word sample, after the average value is calculated, the terminal compares the average value with a preset information entropy threshold value, and the search word sample with the average value larger than the preset information entropy threshold value is screened out as a sample to be marked.

And step 1032, labeling the sample to be labeled to obtain a labeled search term sample.

Specifically, after determining the sample to be marked, the terminal marks the sample to be marked to obtain a marked search word sample. The method for labeling the sample to be labeled may be that the sample to be labeled and the component identification prediction result of the sample to be labeled are pushed to a user side, so as to prompt the user to perform manual labeling.

And 1034, carrying out model training on the component identification model of the article information by marking the search word sample to obtain the component identification model to be optimized.

Specifically, the terminal performs supervised training on the component identification model of the article information through labeling the search word sample to obtain the component identification model to be optimized.

And 1036, performing model training on the component recognition model to be optimized through the newly acquired search word sample to obtain a new component recognition model to be optimized.

Specifically, the terminal performs component recognition prediction on the newly acquired search word samples through the component recognition model to be optimized to obtain component recognition prediction results corresponding to each search word sample, screens out the sample to be marked, which is consistent with marking conditions, from the search word samples, marks the sample to be marked to obtain marked search word samples, and performs model training on the component recognition model to be optimized according to the marked search word samples to obtain the new component recognition model to be optimized.

Step 1038, returning to step 1036 until the latest component recognition model to be optimized reaches the preset iteration stopping training condition, and obtaining a search word component recognition model corresponding to the industry and used for component recognition of the search word to be recognized.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides a search word component recognition model construction device and a search word component recognition device for realizing the search word component recognition model construction method. The implementation of the solution provided by the search word component recognition model construction device and the search word component recognition device is similar to the implementation described in the above method, so specific limitations in the embodiments of the search word component recognition model construction device and the search word component recognition device provided below can be referred to the above limitations on the search word component recognition model construction method and the search word component recognition method, and will not be repeated here.

In one embodiment, as shown in FIG. 11, there is provided a search term component recognition model construction comprising: a prediction module 1102, a screening module 1104, an annotation module 1106, and a model training module 1108, wherein:

the prediction module 1102 is configured to perform component identification prediction on the obtained intra-industry search term samples, so as to obtain component identification prediction results corresponding to each search term sample;

the screening module 1104 is used for screening out a sample to be marked, of which the component identification prediction result meets the marking condition, from the search word samples;

The labeling module 1106 is configured to label the sample to be labeled, so as to obtain a labeled search term sample;

the model training module 1108 is configured to perform model training according to the labeled search word sample, so as to obtain a search word component recognition model corresponding to the industry and used for component recognition of the search word to be recognized.

According to the search word component recognition model construction device, component recognition prediction is carried out on the acquired intra-industry search word samples to obtain the component recognition prediction results corresponding to each search word sample, the sample to be marked, of which the component recognition prediction results meet the marking conditions, is screened out from the search word samples, the sample to be marked can be screened out through active learning, and then the sample to be marked can be marked only, so that the marked search word samples can be obtained, the number of manually marked samples can be greatly reduced, model training can be carried out according to the marked search word samples, the search word component recognition model corresponding to the industry and used for carrying out component recognition on the search word to be recognized is obtained, the number of manually marked samples can be greatly reduced through active learning in the whole process, the construction efficiency of the search word component recognition model is improved through simplifying data marking operation, and the search word component recognition model supporting efficient component recognition is obtained.

In one embodiment, the screening module is further configured to obtain, according to the component recognition prediction result, a probability that each single word in the search word sample belongs to each labeling type, calculate an information entropy of each single word according to the probability that each single word belongs to each labeling type, calculate an average value of the information entropy of each single word in the search word sample, and screen out a search word sample with the average value being greater than a preset information entropy threshold as a sample to be labeled.

In one embodiment, the prediction module is further configured to perform component recognition prediction on the obtained intra-industry search term samples through an industry-corresponding article information component recognition model, so as to obtain respective corresponding component recognition prediction results of each search term sample; the model training module is also used for carrying out attribute mining on the acquired sample article information in the industry to obtain attribute information corresponding to the industry and an attribute dictionary corresponding to the attribute information, wherein the attribute dictionary comprises attribute parameters, the sample article information in the industry is marked according to the attribute information and the attribute parameters in the attribute dictionary to obtain an article information training sample, and the initial component recognition model is trained through the article information training sample to obtain an article information component recognition model corresponding to the industry.

In one embodiment, the model training module is further configured to use an attribute appearing in the sample article information in the industry as a candidate attribute, traverse the sample article information in the industry to obtain an attribute feature and an attribute parameter set of each candidate attribute, calculate an attribute importance of the candidate attribute according to the attribute feature, screen a target attribute meeting an attribute importance screening condition from the candidate attribute according to the attribute importance, combine the target attribute to obtain attribute information, aggregate the attribute parameter set according to the attribute information, and obtain an attribute dictionary matched with the attribute information. In one embodiment, the model training module is further configured to calculate attribute similarity between the target attributes, and combine the similar attributes according to the attribute similarity to obtain attribute information.

In one embodiment, the model training module is further configured to segment object title information in sample object information in the industry to obtain a title segmentation result, text match the title segmentation result through attribute parameters in an attribute dictionary to obtain a text matching result, the text matching result includes a target attribute parameter matched with the attribute parameter, determine corresponding target attribute information from the attribute information according to the target attribute parameter, and label the object title information according to the target attribute parameter and the target attribute information to obtain an object information training sample.

In one embodiment, the model training module is further configured to perform model training on the article information component recognition model by labeling the search word sample to obtain a component recognition model to be optimized, perform model training on the component recognition model to be optimized by using the newly acquired search word sample to obtain a new component recognition model to be optimized, and return to performing model training on the component recognition model to be optimized by using the newly acquired search word sample to obtain a new component recognition model to be optimized until the latest component recognition model to be optimized reaches a preset stopping iterative training condition, thereby obtaining a search word component recognition model for component recognition of the search word corresponding to the industry.

In one embodiment, as shown in fig. 12, there is provided a search term component recognition apparatus, including: a receiving module 1202, an industry identification module 1204, and a component identification module 1206, wherein:

a receiving module 1202, configured to receive a search term processing request, where the search term processing request carries a search term to be identified;

the industry identification module 1204 is used for carrying out industry identification on the search word to be identified and determining a target industry;

the component recognition module 1206 is configured to perform component recognition on the search word to be recognized through a search word component recognition model of the target industry, so as to obtain a search word component recognition result, where the search word component recognition model is constructed through the search word component recognition model construction method.

According to the search word component recognition device, the search word processing request carrying the search word to be recognized is received, the industry recognition is carried out on the search word to be recognized, the target industry is determined, the search word to be recognized is subjected to component recognition through the search word component recognition model of the target industry, the search word component recognition result is obtained, component recognition can be realized by utilizing the search word component recognition model corresponding to the target industry which is constructed efficiently, and the component recognition efficiency can be improved.

The respective modules in the above-described search word component recognition model construction apparatus and search word component recognition apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure thereof may be as shown in fig. 13. The computer device includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input means. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface, the display unit and the input device are connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program when executed by a processor implements a search term component recognition model construction method and a search term component recognition method. The display unit of the computer equipment is used for forming a visual picture, and can be a display screen, a projection device or a virtual reality imaging device, wherein the display screen can be a liquid crystal display screen or an electronic ink display screen, the input device of the computer equipment can be a touch layer covered on the display screen, can also be a key, a track ball or a touch pad arranged on a shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structure shown in FIG. 13 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment, a computer device is provided comprising a memory and a processor, the memory having stored therein a computer program, the processor when executing the computer program performing the steps of:

labeling the sample to be labeled to obtain a labeled search term sample;

In one embodiment, the processor when executing the computer program further performs the steps of: obtaining the probability that each single word in the search word sample belongs to each labeling type according to the component recognition prediction result, respectively calculating the information entropy of each single word according to the probability that each single word belongs to each labeling type, calculating the average value of the information entropy of each single word in the search word sample, and screening out the search word sample with the average value larger than a preset information entropy threshold value as a sample to be labeled.

In one embodiment, the processor when executing the computer program further performs the steps of: component identification prediction is carried out on the acquired intra-industry search word samples through an article information component identification model corresponding to the industry, so that component identification prediction results corresponding to each search word sample are obtained; and carrying out attribute mining on the acquired sample article information in the industry to obtain attribute information corresponding to the industry and an attribute dictionary corresponding to the attribute information, wherein the attribute dictionary comprises attribute parameters, marking the sample article information in the industry according to the attribute information and the attribute parameters in the attribute dictionary to obtain an article information training sample, and training an initial component identification model through the article information training sample to obtain an article information component identification model corresponding to the industry.

In one embodiment, the processor when executing the computer program further performs the steps of: taking the attribute appearing in the sample article information in the industry as a candidate attribute, traversing the sample article information in the industry to respectively obtain attribute characteristics and attribute parameter sets of each candidate attribute, calculating attribute importance of the candidate attribute according to the attribute characteristics, screening out target attributes meeting the attribute importance screening condition from the candidate attribute according to the attribute importance, merging the target attributes to obtain attribute information, and aggregating the attribute parameter sets according to the attribute information to obtain an attribute dictionary matched with the attribute information. In one embodiment, the processor when executing the computer program further performs the steps of: and calculating attribute similarity among the target attributes, and merging the similar attributes according to the attribute similarity to obtain attribute information.

In one embodiment, the processor when executing the computer program further performs the steps of: the method comprises the steps of segmenting object title information in sample object information in industry to obtain a title word segmentation result, carrying out text matching on the title word segmentation result through attribute parameters in an attribute dictionary to obtain a text matching result, determining corresponding target attribute information from the attribute information according to the target attribute parameters, and labeling the object title information according to the target attribute parameters and the target attribute information to obtain an object information training sample.

In one embodiment, the processor when executing the computer program further performs the steps of: and carrying out model training on the component recognition model of the article information by marking the search word sample to obtain a component recognition model to be optimized, carrying out model training on the component recognition model to be optimized by using the newly acquired search word sample to obtain a new component recognition model to be optimized, returning to the step of carrying out model training on the component recognition model to be optimized by using the newly acquired search word sample to obtain the new component recognition model to be optimized until the latest component recognition model to be optimized reaches a preset stopping iteration training condition, and obtaining the search word component recognition model which corresponds to the industry and is used for carrying out component recognition on the search word to be recognized.

In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:

labeling the sample to be labeled to obtain a labeled search term sample;

In one embodiment, the computer program when executed by the processor further performs the steps of: obtaining the probability that each single word in the search word sample belongs to each labeling type according to the component recognition prediction result, respectively calculating the information entropy of each single word according to the probability that each single word belongs to each labeling type, calculating the average value of the information entropy of each single word in the search word sample, and screening out the search word sample with the average value larger than a preset information entropy threshold value as a sample to be labeled.

In one embodiment, the computer program when executed by the processor further performs the steps of: component identification prediction is carried out on the acquired intra-industry search word samples through an article information component identification model corresponding to the industry, so that component identification prediction results corresponding to each search word sample are obtained; and carrying out attribute mining on the acquired sample article information in the industry to obtain attribute information corresponding to the industry and an attribute dictionary corresponding to the attribute information, wherein the attribute dictionary comprises attribute parameters, marking the sample article information in the industry according to the attribute information and the attribute parameters in the attribute dictionary to obtain an article information training sample, and training an initial component identification model through the article information training sample to obtain an article information component identification model corresponding to the industry.

In one embodiment, the computer program when executed by the processor further performs the steps of: taking the attribute appearing in the sample article information in the industry as a candidate attribute, traversing the sample article information in the industry to respectively obtain attribute characteristics and attribute parameter sets of each candidate attribute, calculating attribute importance of the candidate attribute according to the attribute characteristics, screening out target attributes meeting the attribute importance screening condition from the candidate attribute according to the attribute importance, merging the target attributes to obtain attribute information, and aggregating the attribute parameter sets according to the attribute information to obtain an attribute dictionary matched with the attribute information. In one embodiment, the computer program when executed by the processor further performs the steps of: and calculating attribute similarity among the target attributes, and merging the similar attributes according to the attribute similarity to obtain attribute information.

In one embodiment, the computer program when executed by the processor further performs the steps of: the method comprises the steps of segmenting object title information in sample object information in industry to obtain a title word segmentation result, carrying out text matching on the title word segmentation result through attribute parameters in an attribute dictionary to obtain a text matching result, determining corresponding target attribute information from the attribute information according to the target attribute parameters, and labeling the object title information according to the target attribute parameters and the target attribute information to obtain an object information training sample.

In one embodiment, the computer program when executed by the processor further performs the steps of: and carrying out model training on the component recognition model of the article information by marking the search word sample to obtain a component recognition model to be optimized, carrying out model training on the component recognition model to be optimized by using the newly acquired search word sample to obtain a new component recognition model to be optimized, returning to the step of carrying out model training on the component recognition model to be optimized by using the newly acquired search word sample to obtain the new component recognition model to be optimized until the latest component recognition model to be optimized reaches a preset stopping iteration training condition, and obtaining the search word component recognition model which corresponds to the industry and is used for carrying out component recognition on the search word to be recognized.

In one embodiment, a computer program product is provided comprising a computer program which, when executed by a processor, performs the steps of:

labeling the sample to be labeled to obtain a labeled search term sample;

It should be noted that, the article information (including but not limited to the sample article information in industry, etc.) and the data (including but not limited to the data for analysis, the stored data, the displayed data, etc.) related to the present application are both information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. A method for constructing a search term component recognition model, the method comprising:

screening out a sample to be marked, of which the component identification prediction result meets marking conditions, from the search word samples;

Labeling the sample to be labeled to obtain a labeled search term sample;

and training the model according to the marked search word sample to obtain a search word component recognition model which corresponds to the industry and is used for component recognition of the search word to be recognized.

2. The method according to claim 1, wherein the screening the sample to be annotated, for which the component recognition prediction results meet an annotation condition, from the search term samples includes:

obtaining the probability that each single word in the search word sample belongs to each labeling type according to the component identification prediction result;

respectively calculating the information entropy of each single word according to the probability of each labeling type;

3. The method of claim 1, wherein performing component recognition prediction on the obtained intra-industry search term samples to obtain respective component recognition prediction results for each search term sample comprises:

Wherein the article information component identification model is constructed by:

performing attribute mining on the acquired sample article information in the industry to obtain attribute information corresponding to the industry and an attribute dictionary corresponding to the attribute information, wherein the attribute dictionary comprises attribute parameters;

4. The method of claim 3, wherein the performing attribute mining on the obtained intra-industry sample article information to obtain industry-corresponding attribute information and an attribute dictionary corresponding to the attribute information comprises:

traversing the sample article information in the industry by taking the attribute appearing in the sample article information in the industry as a candidate attribute to respectively obtain attribute characteristics and attribute parameter sets of each candidate attribute;

combining the target attributes to obtain attribute information;

and according to the attribute information, aggregating the attribute parameter set to obtain an attribute dictionary matched with the attribute information.

5. The method of claim 4, wherein merging the target attributes to obtain attribute information comprises:

calculating attribute similarity between the target attributes;

and merging similar attributes in the target attributes according to the attribute similarity to obtain attribute information.

6. The method of claim 3, wherein labeling the intra-industry sample item information according to the attribute information and the attribute parameters in the attribute dictionary to obtain an item information training sample comprises:

performing word segmentation on object title information in the sample object information in the industry to obtain a title word segmentation result;

performing text matching on the title word segmentation result through attribute parameters in the attribute dictionary to obtain a text matching result, wherein the text matching result comprises target attribute parameters matched with the attribute parameters;

7. The method of claim 3, wherein the performing model training according to the labeled search term sample to obtain a search term component recognition model corresponding to the industry and used for component recognition of the search term to be recognized comprises:

carrying out model training on the article information component identification model through the labeling search word sample to obtain a component identification model to be optimized;

performing model training on the component identification model to be optimized through the newly acquired search word sample to obtain a new component identification model to be optimized;

and returning the newly acquired search word sample to perform model training on the component recognition model to be optimized to obtain a new component recognition model to be optimized until the latest component recognition model to be optimized reaches a preset stopping iteration training condition, so as to obtain a search word component recognition model which corresponds to the industry and is used for component recognition of the search word to be optimized.

8. A method for identifying search term components, the method comprising:

industry recognition is carried out on the search word to be recognized, and a target industry is determined;

and carrying out component recognition on the search word to be recognized through a search word component recognition model of the target industry to obtain a search word component recognition result, wherein the search word component recognition model is constructed through the method of any one of claims 1-7.

9. A search term component recognition model construction apparatus, the apparatus comprising:

the screening module is used for screening the sample to be marked, of which the component identification prediction result meets the marking condition, from the search word samples;

10. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 8 when the computer program is executed.

11. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 8.

12. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the steps of the method of any one of claims 1 to 8.