CN114238617A

CN114238617A - Industry hotspot recommendation method and system

Info

Publication number: CN114238617A
Application number: CN202111567846.XA
Authority: CN
Inventors: 许冠中
Original assignee: Shenzhen Power Supply Bureau Co Ltd
Current assignee: Shenzhen Power Supply Bureau Co Ltd
Priority date: 2021-12-21
Filing date: 2021-12-21
Publication date: 2022-03-25

Abstract

The invention provides an industry hotspot recommendation method and system, which comprises the steps of collecting target data of each target data source; preprocessing the target data according to a preset preprocessing rule; inputting the preprocessed target data serving as input quantity into a pre-trained recommendation model for calculation to obtain hot keywords; and counting the occurrence frequency of the hot keywords in a preset time period, sequencing the hot keywords according to the occurrence frequency, and outputting a sequencing result as an industry hot point recommendation result. The invention collects the massive heterogeneous content data of multilevel and multi-audience such as continuously released internal reference, news, dynamic, work report, academic journal paper and the like as the industry corpus data, and helps enterprise managers to analyze, discover and mine the key points, hot points and development trend dynamics of the industry from the basic level to the area in an all-around way through the preset training model.

Description

Industry hotspot recommendation method and system

Technical Field

The invention relates to the technical field of natural pre-language processing, in particular to an industry hotspot recommendation method and system.

Background

There are a large amount of news manuscripts, notice, documents writing requirements such as report in the current industry, and document writer need spend a large amount of manpowers and persist the document middle level layer analysis from magnanimity, screens and writes the material, has to omit, wastes time, a great deal of problems such as hard, writes the efficiency lower, and basic level burden is heavier.

The existing method has the advantages that an algorithm is used, fresh hotspots of enterprises in the industry are solved, trend dynamics of all levels of organizations of the enterprises are analyzed, hotspot analysis, hotspot news compilation and new event news compilation functions are provided for related personnel or users, intelligence collection work is assisted, and auxiliary services are provided for accurate draft reduction of offices. However, how to improve the efficiency of industry hotspot discovery and further provide materials and recommend questions for content creators in the target industry is a big difficulty at present.

Disclosure of Invention

The invention aims to provide an industry hotspot recommendation method and system, and solves the technical problems that the existing method is low in industry hotspot discovery efficiency and poor in recommendation topic selection accuracy.

In one aspect, an industry hotspot recommendation method is provided, which includes:

collecting target data of each target data source; preprocessing the target data according to a preset preprocessing rule;

inputting the preprocessed target data serving as input quantity into a pre-trained recommendation model for calculation to obtain hot keywords;

and counting the occurrence frequency of the hot keywords in a preset time period, sequencing the hot keywords according to the occurrence frequency, and outputting a sequencing result as an industry hot point recommendation result.

Preferably, the preprocessing the target data according to a preset preprocessing rule specifically includes:

converting the collected target data into a document in an HTML format and filing according to preset classification categories to obtain a text to be cleaned;

and converting the text to be cleaned into a document tree.

Preferably, the converting the text to be cleaned into the document tree specifically includes:

analyzing the label of the text to be cleaned as a document title, and if the label does not exist in the input text to be cleaned, taking the file name of the text to be cleaned as the document title;

segmenting the whole document according to the document title to obtain a plurality of text nodes;

and analyzing the parent-child relationship of each text node on the document tree structure according to the text nodes, and associating all the text nodes according to the parent-child relationship of each text node on the document tree structure to obtain the document tree.

Preferably, the recommendation model specifically includes:

the preprocessing layer is used for preprocessing input data to calculate corresponding initialization weights and carrying out weighted summation according to the obtained initialization weights to obtain corresponding word vectors;

the forward LSTM layer is used for carrying out forward calculation on the corresponding word vector to obtain a first calculated value;

the backward LSTM layer is used for calculating the first calculation value backward to obtain a coding output value;

and the CRF layer is used for screening the received coding output value to obtain a globally optimal output sequence.

Preferably, the obtaining of the popular keywords specifically includes:

identifying collected target data, carrying out sequence labeling according to a preset labeling standard to obtain a labeling corpus, and converting the target data into a training set and a test set according to the labeling corpus;

inputting the training set and the test set into a preset recommendation model to obtain a word vector of each word, and firstly inputting the obtained word vector into the forward LSTM layer and the backward LSTM layer for bidirectional coding to obtain a coding vector of the whole sentence;

and the CRF layer processes the coding vector of the whole sentence as an input quantity, obtains a globally optimal output sequence through a preset screening rule, and outputs corresponding words as hot keywords according to the optimal output sequence.

Preferably, the sorting the top keywords according to the occurrence frequency specifically includes:

counting the relative word frequency of each keyword in a preset time period, and taking the relative word frequency as the occurrence frequency, wherein the relative word frequency is the product of the ratio of the absolute word frequency of each keyword in the preset time period to the ratio of the total word frequency of the keywords in the preset time period;

and sequencing the corresponding hot keywords according to the sequence of the occurrence frequency from high to low to obtain the sequence of the hot keywords.

On the other hand, an industry hotspot recommendation system is further provided, which is used for implementing the industry hotspot recommendation method, and comprises the following steps:

the data acquisition module is used for acquiring target data of each target data source; preprocessing the target data according to a preset preprocessing rule;

the keyword extraction module is used for inputting the preprocessed target data serving as input quantity into a pre-trained recommendation model for calculation to obtain popular keywords;

and the recommendation module is used for counting the occurrence frequency of the hot keywords in a preset time period, sequencing the hot keywords according to the occurrence frequency and outputting a sequencing result as an industry hotspot recommendation result.

Preferably, the data acquisition module is further configured to convert the acquired target data into a document in an HTML format and archive the document according to a preset classification category to obtain a text to be cleaned;

Preferably, the keyword extraction module is further configured to identify collected target data, perform sequence tagging according to a preset tagging standard to obtain a tagging corpus, and convert the target data into a training set and a test set according to the tagging corpus;

Preferably, the recommending module is further configured to count a relative word frequency of each keyword in a preset time period, and use the relative word frequency as an occurrence frequency, where the relative word frequency is a product of a ratio of a key word frequency of each absolute word frequency in the preset time period and a ratio of a total amount of the key word frequencies in the preset time period;

In summary, the embodiment of the invention has the following beneficial effects:

according to the industry hotspot recommendation method and system provided by the invention, the continuously released massive heterogeneous content data of multilevel and multi-audience, such as internal parameters, news, trends, work reports, academic journal papers and the like, is collected to serve as the industry corpus data, and an enterprise manager is helped to analyze, discover and develop the key points, hotspots and development trend trends of the industry all around from the basic level to the region continuously through the preset training model. By taking the time window of data acquisition and trend analysis as a unit, the method can reflect the changes of the industrial key points and the hot point trends in time and provide industrial hot point trend analysis reports with different time spans.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is within the scope of the present invention for those skilled in the art to obtain other drawings based on the drawings without inventive exercise.

Fig. 1 is a main flow diagram of an industry hotspot recommendation method in an embodiment of the present invention.

FIG. 2 is a diagram illustrating a pre-trained recommendation model according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of an industry hotspot recommendation system in an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings.

Fig. 1 is a schematic diagram of an embodiment of an industry hotspot recommendation method provided by the present invention. In this embodiment, the method comprises the steps of:

collecting target data of each target data source; preprocessing the target data according to a preset preprocessing rule; that is, a distributed web crawler is deployed, professional content resources such as news consultation, enterprise dynamics, announcements, journal articles and the like in the power industry are respectively captured from the internal network and the external network, collected HTML article contents are filed to a text queue to be cleaned according to professional classification, service theme, source and date classification, and HTML texts are divided into hierarchical document trees.

In a specific embodiment, the preprocessing the target data according to a preset preprocessing rule specifically includes: converting the collected target data into a document in an HTML format and filing according to preset classification categories to obtain a text to be cleaned; and converting the text to be cleaned into a document tree. Specifically, a label of the text to be cleaned is analyzed as a document title, and if the label does not exist in the input text to be cleaned, a file name of the text to be cleaned is used as the document title; segmenting the whole document according to the document title to obtain a plurality of text nodes; and analyzing the parent-child relationship of each text node on the document tree structure according to the text nodes, and associating all the text nodes according to the parent-child relationship of each text node on the document tree structure to obtain the document tree. It will be appreciated that the HTML content parsing service performs the following steps:

the document header in HTML format is parsed. And according to the HTML label information, performing title analysis on the input file, wherein the analyzed title node can be used as a root node of the document tree. In the parsing process, the content in the HTML tag title tag is preferentially parsed as a document title, and if the title tag does not exist in the input file, the file name of the input file is used as the document title.

Document segmentation: according to the HTML label information, the input file with the chapter granularity is cut into text nodes with the paragraph granularity, so that the subsequent analysis of the catalog and the chapter hierarchy is facilitated. In the analysis process, h tags, p tags and div tags are used for segmentation, an input document is segmented into texts with paragraph granularity, and HTML tags and style information are recorded on each text node.

And (5) document level analysis. And analyzing the parent-child relationship of each text node on the document tree structure according to the document nodes after segmentation, namely the parent-child relationship among the titles of different levels and the parent-child relationship between the titles and the text, so as to construct the document tree. The process of hierarchical analysis relates to the identification of a text node chapter mode, a chapter serial number and a chapter hierarchy, and a sequence for storing global information is recorded in the identification process, wherein the specific identification method comprises the following steps:

(a) initializing two sequences of chapter modes and chapter serial numbers for storing global information; initializing a document tree;

(b) constructing a root node of the document tree according to the title node, and setting the hierarchy of the root node as 0;

(c) the title node information is inserted into the global sequence, the chapter mode is 'title', and the chapter number is 0

(d) Setting a root node of the document tree as a reference tree node;

(e) and analyzing the chapter mode and the chapter number corresponding to the document node, wherein the specific rule is as follows:

plain text: and identifying an explicit chapter information mode appearing in the text of the node, and analyzing the text mode as a chapter mode and the number as a chapter number.

HTML label: and identifying chapter information of a text node through HTML tags such as ol, ul and li, wherein HTML xpath information of the node except the li tag is used as a chapter mode, and sequence number information in the li tag is used as a chapter sequence number.

Hit chapter information in the directory page: sometimes the chapter information declared in the table of contents in the document omits its explicit chapter mode in the body. In the process of analysis, matching is carried out on the texts without chapter results in a directory, and chapter information in the directory is used as supplement if the same text content exists.

(f) Identifying chapter sequence numbers corresponding to the document nodes according to the chapter modes of the current nodes and the global chapter mode sequence: and if the global chapter mode sequence contains the chapter mode of the node, considering the position information of the chapter mode in the sequence as the chapter level of the node, and otherwise, considering the chapter level of the node as the chapter mode sequence length + 1.

(g) And updating the global chapter mode sequence and the sequence number sequence according to the hierarchy information of the current node. And truncating the global chapter mode sequence and the sequence number sequence to a level with the length equal to that of the current node.

(h) Creating a tree node of the current text node, and if the hierarchy of the tree node is larger than that of the reference tree node, taking the tree node as a child of the reference tree node; if the hierarchy is equal to the reference tree node, the reference tree node is taken as a brother of the reference tree node; if the hierarchy of the reference tree node is smaller than the reference tree node, moving the pointer of the reference tree node to the direction of the root node until the current hierarchy is larger than the reference tree node, and taking the current tree node as a child of the reference tree node;

(i) and e-h is repeated until the traversal of all the text nodes of the whole document is completed.

Further, inputting the preprocessed target data serving as input quantity into a pre-trained recommendation model for calculation to obtain hot keywords; that is, aiming at the language and structure characteristics of articles in the power industry, the system adopts an entity relationship joint extraction method based on a pre-training language model, and combines a BIO labeling method + category + relationship to perform entity relationship joint labeling, so as to extract related information. The system adds relation labels on the basis of traditional position and category labels. The position labeling in the BIO labeling method is divided into B (begin) -entity initial position; i (inside) -physical neutral position; o (Outside) -location outside of the entity, i.e., location unrelated to the entity.

In a specific embodiment, as shown in fig. 2, the recommendation model specifically includes: the preprocessing layer is used for preprocessing input data to calculate corresponding initialization weights and carrying out weighted summation according to the obtained initialization weights to obtain corresponding word vectors; the forward LSTM layer is used for carrying out forward calculation on the corresponding word vector to obtain a first calculated value; the backward LSTM layer is used for calculating the first calculation value backward to obtain a coding output value; and the CRF layer is used for screening the received coding output value to obtain a globally optimal output sequence. And converting sentences into a labeling sequence by adopting a Seq2Seq model and a joint labeling strategy and combining a self-annotation mechanism of a pre-training model so as to realize entity relationship joint extraction. Firstly, mapping word vectors in a high-dimensional space to a low-dimensional continuous space based on a pre-training language model BERT, namely converting sentences into vector representations, and coding the input word vectors through a bidirectional long-short time memory network because the bidirectional long-short time memory network and a conditional random field have better performance in an open field entity relationship extraction task, wherein the Bi-LSTM comprises two parallel LSTM layers: the decoding method comprises a forward LSTM layer and a backward LSTM layer, wherein the forward LSTM layer calculates forwards, the backward LSTM layer calculates backwards, then the decoding layer is passed, the output of the coding layer is used as the input of the decoding layer to be processed by adopting a conditional random field CRF model, and finally, the softmax is used for obtaining a globally optimal output sequence. The attention mechanism adds an attention range to each word, and enables the model to simulate the thinking mode of a human to effectively filter information, so that the generated labels not only focus on the global semantic coding vector, but also focus on relevant and important parts in the sequence, and a next output sequence is generated according to the focused region of the model and the attention weight. And finally, the decoded labeling sequence includes the role of the character in the entity and the category of the entity, the entity relationship is obtained according to the labeling sequence combination, the output labeling sequence is the corresponding entity relationship joint extraction, and the whole process is the entity relationship joint extraction method based on the pre-training model.

Specifically, the obtaining of the popular keywords specifically includes: identifying collected target data, carrying out sequence labeling according to a preset labeling standard to obtain a labeling corpus, and converting the target data into a training set and a test set according to the labeling corpus; inputting the training set and the test set into a preset recommendation model to obtain a word vector of each word, and firstly inputting the obtained word vector into the forward LSTM layer and the backward LSTM layer for bidirectional coding to obtain a coding vector of the whole sentence; and the CRF layer processes the coding vector of the whole sentence as an input quantity, obtains a globally optimal output sequence through a preset screening rule, and outputs corresponding words as hot keywords according to the optimal output sequence. It can be understood that the concrete operation of the entity relationship joint extraction based on the pre-training model mainly comprises three steps: the method comprises the steps of firstly, carrying out BIO labeling on part of collected professional texts to form labeled linguistic data, converting the linguistic data according to a form required by a model, and dividing the linguistic data into a training set and a testing set. And secondly, inputting the preprocessed data into an embedding layer, mapping the preprocessed data to a low-dimensional space through a BERT pre-training language model, obtaining a word vector representation { w1, w2,. once.. wn } of each word, converting the word vector representation into a vector representation, and inputting the obtained word vector into a forward long-short-time memory network and then into a backward long-short-time memory network for bidirectional coding. The forward LSTM encodes from w1 to wn, the backward LSTM encodes from wn to w1, and the encoded vector of the entire sentence is output at the hidden layer of the neuron. And thirdly, after two bidirectional long-time and short-time memory network layer codes, adding a full connection layer. The part adopts a conditional random field CRF model to treat the output of the coding layer as the input of the decoding layer, and finally, softmax is used for acquiring a globally optimal output sequence. And combining the entities and the relations to finally obtain the required triples. The method for the combined extraction of the entity relations can deepen the relation among the entities and improve the extraction accuracy to a certain extent. As the sequence marking is used for the entity relation extraction task, the position information and the relation category are added on the basis of marking the entity category by adopting a BIO marking method, the extraction model is designed on the basis of a conditional random field and a bidirectional long-time and short-time memory network, and the input sequence is respectively subjected to forward coding and backward coding. The entity-relationship joint labeling method is adopted to treat named entity identification and relationship extraction as a common task, so that accumulated errors caused by extraction of a pipeline model are reduced, the relationship between entities and the relationship is tighter, in order to solve the long-term dependence problem, labeling is performed based on a deep learning method, a large number of complex features do not need to be manually constructed, and the model efficiency is improved. In addition, by adopting a covering mechanism and a self-attention mechanism of the pre-training language model, the calculation and processing load of a decoding layer on a high-dimensional input sequence can be reduced, the input subset is structurally selected, the data dimension is reduced, meanwhile, different attention can be allocated to different sequences by the attention mechanism, the discovery and attention of the sequences related to the current output are facilitated, and the quality of the output is improved. The entity relation combined extraction method has the advantages of close entity relation and high accuracy, and is convenient for fully mining the entity relation in the electric power professional field. It should be noted that the training process of the training model is the same as the specific process for obtaining the hit keywords, and the longer the training time, the more times and the higher the accuracy.

Further, the occurrence frequency of the hot keywords in a preset time period is counted, the hot keywords are sorted according to the occurrence frequency, and a sorting result is output as an industry hotspot recommendation result. That is, to overcome the limitation of the absolute word frequency analysis of the keywords, the domain hotspot and trend exploration is realized by multi-factor weighting and score ranking of the keywords, the regions, the functional departments and the like. And constructing a time-keyword frequency secondary matrix, processing keyword word frequency by using horizontal weighting and vertical weighting, designing a relative word frequency model, and calculating a weighted comprehensive score of the keywords to obtain more effective keyword sequencing. Based on the weighted ranking of the keywords, the keywords with high quality and high quality, the keywords with low quality and high quality and the keywords with mutant types can be identified, and the method is beneficial to mining research hotspots and analyzing trends.

In a specific embodiment, the relative word frequency of each keyword in a preset time period is counted, and the relative word frequency is taken as the occurrence frequency, wherein the relative word frequency is the product of the ratio of the key word frequency of each keyword absolute word frequency in the preset time period and the ratio of the total key word frequency in the preset time period; and sequencing the corresponding hot keywords according to the sequence of the occurrence frequency from high to low to obtain the sequence of the hot keywords. It can be understood that the word frequency number of the keywords appearing in the observation time window in the collected content data is counted, and a time-keyword matrix C (n, m) is constructed, wherein n is the total number of the keywords, and m is the number of days of the collected keywords. And calculating the relative word frequency of each keyword in a time sequence window, namely the product of the absolute word frequency of each keyword and the word frequency ratio of the key words in the observation time window and the total word frequency ratio of the key words in the time domain, and constructing a relative word frequency matrix R (n, m). The discrete distribution of the keywords and the keyword frequencies conforms to a Logistic model, namely, a Logistic model function accurately reveals the discrete distribution rule of a knowledge unit. Therefore, a logistic model is adopted to endow time weight to the relative word frequency, so that the weight is larger when the relative word frequency is closer to the current time, and the frequency of the simulated keywords is changed continuously along with the time. Namely, the newer word frequency information contributes to research hotspots and trend prediction to a larger extent, and the time function accords with the actually observed distribution rule. Temporal weighting is performed using a logistic function, and a temporal weighting coefficient is calculated. The time series distribution of the keywords can reflect industry emphasis within the observation time range, and the change of the keywords along with the time can reflect industry hotspots. And calculating the change rate to reflect the overall change trend of the keywords. And (4) calculating the change rate, and transforming by adopting an exponential function of e to balance the effect of each index on the comprehensive value. And calculating a comprehensive weighted score value based on the change rate, the relative word frequency, the time weight coefficient, the propagation coefficient (calculated according to the ratio of the forwarding amount to the access amount), the region weight coefficient and the function department weight coefficient. And sequencing the finally calculated weighted average, identifying an industry hotspot and predicting a development trend. Specifically, the keywords with the weighted scores and the top ranks have the characteristics of high word frequency or prominent growth trend, are beneficial to revealing industry hotspots and predicting development trends, and can help enterprises focus on key business directions accurately. The ascending high-frequency words are ranked and floated, and the descending high-frequency words are sunk, so that the industrial hotspots can be quickly identified. And enhancing ascending intermediate frequency words, weakening descending intermediate frequency words, and representing the relationship between the ranking ascending amount and the emergency. The comprehensive scores of the ascending low-frequency words and the keywords with obvious mutation are advanced, so that the low-frequency words with development potential can be identified. By inputting a keyword, a search is performed. The system may present a trend of popularity of the keyword input by the user. Meanwhile, map analysis can be carried out on the hot keywords, and associated content is displayed from multiple angles such as association degree, heat degree and heat degree change trend.

Fig. 3 is a schematic diagram of an embodiment of an industry hotspot recommendation system provided by the present invention. In this embodiment, the method includes:

the data acquisition module is used for acquiring target data of each target data source; preprocessing the target data according to a preset preprocessing rule; specifically, the data acquisition module is further configured to convert the acquired target data into a document in an HTML format and archive the document according to a preset classification category to obtain a text to be cleaned;

The keyword extraction module is used for inputting the preprocessed target data serving as input quantity into a pre-trained recommendation model for calculation to obtain popular keywords; specifically, the keyword extraction module is further configured to identify collected target data, perform sequence labeling according to a preset labeling standard to obtain a labeling corpus, and convert the target data into a training set and a test set according to the labeling corpus;

And the recommendation module is used for counting the occurrence frequency of the hot keywords in a preset time period, sequencing the hot keywords according to the occurrence frequency and outputting a sequencing result as an industry hotspot recommendation result. Specifically, the recommending module is further configured to count a relative word frequency of each keyword in a preset time period, and use the relative word frequency as an occurrence frequency, where the relative word frequency is a product of a key word frequency ratio of each absolute word frequency of each keyword in the preset time period and a key word frequency total ratio in the preset time period;

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present invention, and it is therefore to be understood that the invention is not limited by the scope of the appended claims.

Claims

1. An industry hotspot recommendation method is characterized by comprising the following steps:

2. The method of claim 1, wherein the preprocessing the target data according to a preset preprocessing rule specifically comprises:

and converting the text to be cleaned into a document tree.

3. The method of claim 2, wherein the converting the text to be cleaned into a document tree specifically comprises:

4. The method of claim 3, wherein the recommendation model specifically comprises:

5. The method of claim 4, wherein the obtaining of the trending keyword specifically comprises:

6. The method of claim 5, wherein said ranking the trending keywords according to the frequency of occurrence specifically comprises:

7. An industry hotspot recommendation system for implementing the method of any one of claims 1-6, comprising:

8. The system of claim 7, wherein the data acquisition module is further configured to convert the acquired target data into HTML-formatted documents and archive the documents according to preset classification categories to obtain texts to be cleaned;

9. The system according to claim 8, wherein the keyword extraction module is further configured to identify collected target data, perform sequence labeling according to a preset labeling standard to obtain a labeling corpus, and convert the target data into a training set and a test set according to the labeling corpus;

10. The system of claim 9, wherein the recommending module is further configured to count a relative word frequency of each keyword in a preset time period, and take the relative word frequency as an appearance frequency, wherein the relative word frequency is a product of a ratio of a keyword word frequency of each keyword absolute word frequency in the preset time period and a ratio of a total amount of the keyword word frequency in the preset time period;