CN117951246B

CN117951246B - New word discovery and application field prediction method and system for network technology

Info

Publication number: CN117951246B
Application number: CN202410351116.3A
Authority: CN
Inventors: 丁建伟; 李斌; 李航; 李欣泽; 陈周国; 王泽珺; 王鑫
Original assignee: CETC 30 Research Institute
Current assignee: CETC 30 Research Institute
Priority date: 2024-03-26
Filing date: 2024-03-26
Publication date: 2024-05-28
Anticipated expiration: 2044-03-26
Also published as: CN117951246A

Abstract

The invention discloses a network technology new word discovery and application field prediction method and system, relates to the field of natural language processing, and is used for improving accuracy of network technology new word discovery and field prediction. The method comprises three parts, wherein the first part is to preliminarily determine new seed words and the application field thereof by utilizing a mode of manually collecting similar words and obtaining Glove word vector models; the second part is to collect the latest updated scientific text data in the external knowledge base for storage; the third part is to determine new words of network technology by combining multiple NLP models and predict corresponding application fields. The invention deeply digs the self characteristics of the new word of the network technology, fully considers the meaning expressed in the sentence, and improves the recall rate of the new word under the condition of ensuring the correct rate; and the maximum public substring is utilized to merge the application fields of the new words, so that the prediction accuracy of the application fields is further improved.

Description

New word discovery and application field prediction method and system for network technology

Technical Field

The invention relates to the field of natural language processing, in particular to a network technology new word discovery and application field prediction system based on multiple models.

Background

With the rapid development of the internet, the network security situation is increasingly complex. Therefore, the nouns of the network technology (called as network technology new words) are discovered in time, corresponding application predictions are made, network attacks, illegal transactions and the like can be early warned in time, and the network environment safety is maintained.

The generation of new words of network technology is frequent, particularly in the present big data and big model age, and the new words are very time-consuming and labor-consuming to find manually, and have high miss report rate, so that most people can know the words when the words are widely used by people. At present, the techniques of machine learning, natural language processing and the like are widely applied to the discovery of new words in network technology. The popular new word discovery schemes at present can be judged by combining word frequency, for example, a network new word discovery method and system based on statistics and similarity, which are designed in Chinese patent literature with publication number of CN 113033183A. However, when the vocabulary just appears and the word frequency is low, detection cannot be performed, so that information delay creates certain difficulties for network attack, illegal transaction and information hazard early warning. In addition, there is also a network new word discovery scheme for performing vocabulary clustering in combination with semantic similarity, for example, a new network new word discovery method based on sentence semantic similarity, which is designed in chinese patent literature with publication number CN117574886 a. However, the scheme of semantic similarity comparison is highly related to the richness of word segmentation rules, contexts and standard corpus, the mode of finding new words is limited, and features of network new words are easily ignored in an inverse manner, so that the finally found network new words are easily deviated from the actual application field.

Disclosure of Invention

The invention aims at: aiming at all or part of the existing problems, the network technology new word discovery and application field prediction system is provided to break through the limitation of the prior art in the aspect of discovering the network technology new word, mine the deep features of the network technology new word, and discover the network technology new word and predict the application field more accurately.

The technical scheme adopted by the invention is as follows:

A network technology new word discovery and application field prediction method comprises the following steps:

Determining a first number of seed new words; expanding the number of seed new words from the collected corpus to a second number using a Glove word vector model; marking application fields of various new words and storing the new words;

Collecting latest updated scientific text data from an external knowledge base;

Updating a first keyword weight dictionary of the KeyBERT model and a second keyword weight dictionary of the LAC model by using each piece of scientific text data;

Sequentially comparing each key value pair in the first keyword weight dictionary and the second keyword weight dictionary, and updating a third keyword weight dictionary with the corresponding key and the corresponding value when the edit distance of the two keys compared is within a first threshold;

selecting a key with a value reaching a first condition from the third keyword weight dictionary as a final network technology new word;

carrying out semantic analysis on scientific text data corresponding to the final network technology new word so as to predict the application field of the final network technology new word;

and associating the application field of the final network technology new word with the application field of the stored network technology new word.

Further, the associating the application field of the final network technology new word with the application field of the stored network technology new word includes:

and determining whether the final network technology new word and the stored network technology new word are related to the same application field according to the gap between the final network technology new word and the stored network technology new word.

Further, calculating the maximum public sub-string length between the final network technology new word and the stored network technology new word by using a maximum sub-string algorithm, and associating the final network technology new word and the stored network technology new word to the same application field when the maximum public sub-string length is greater than zero.

Further, the collecting the latest updated scientific text data from the external knowledge base includes:

collecting latest updated scientific text information and scientific image information from an external knowledge base;

and extracting text information in the scientific image information, and combining the text information with the scientific text information to obtain scientific text data.

Further, updating the first keyword weight dictionary of KeyBERT models and the second keyword weight dictionary of LAC models with each piece of the scientific text data includes:

initializing KeyBERT a first keyword weight dictionary of the model and a second keyword weight dictionary of the LAC model;

for each piece of scientific text data, extracting unigram keywords, bigram keywords and corresponding weights respectively by using KeyBERT models, and updating the first keyword weight dictionary by taking the keywords as keys and the sum of the weights as a value; extracting unigram keywords, bigram keywords and corresponding weights respectively by using the LAC model, taking the keywords as keys, taking the sum of the weights as a value, and updating the second keyword weight dictionary.

Further, the sequentially comparing each key value pair in the first keyword weight dictionary and the second keyword weight dictionary, and updating the third keyword weight dictionary with the corresponding key and the corresponding value when the edit distance of the two compared keys is within the first threshold value, including:

And respectively extracting a first key value pair { key1: value1} of the first keyword weight dictionary and a second key value pair { key2: value2} of the second keyword weight dictionary in sequence, and updating the third keyword weight dictionary as follows when the editing distance of the key1 = key2 or the editing distance of the key2 and the second key value pair { key2: value2} is not more than 1:

and updating the third keyword weight dictionary by taking key2 as a key, taking value1+lg (value 2) as a value, or taking key1 as a key, and taking value2+lg (value 1) as a value.

Further, selecting a key with a value reaching a first condition from the third keyword weight dictionary as a final network technology new word, including:

sorting each key value pair in the third keyword weight dictionary according to the order of the weights from small to large;

Screening out the first number of key value pairs;

the weight of each key value pair is standardized by using a Max-Min standardization algorithm;

and screening out keys with standardized weights reaching a second threshold value as final network technology new words.

Further, the determining a first number of seed new words; expanding the number of seed new words from the collected corpus to a second number using a Glove word vector model, comprising:

Setting a first number of seed new words;

Repeating the following steps until the seed new word reaches a second number:

collecting corpus, wherein each corpus at least comprises one new seed word;

training Glove word vector models by using the collected corpus;

Obtaining similar words of the new seed words by using the trained Glove word vector model;

And calculating the similarity between the similar words and the new seed words, and screening the obtained similar words according to a set third threshold value to expand the number of the new seed words by the screened similar words.

The invention also provides a network technology new word discovery and application field prediction system, which comprises a processor, wherein the processor is configured to execute the network technology new word discovery and application field prediction method.

The invention also provides another network technology new word discovery and application field prediction system, which comprises a computer readable storage medium, wherein the computer readable storage medium is stored with a computer program, and the computer program is operated to execute the network technology new word discovery and application field prediction method.

In summary, due to the adoption of the technical scheme, the beneficial effects of the invention are as follows:

The invention constructs a method for automatically finding the network technology new word by combining the Glove word vector model, the Keybert, the LAC model, the edit distance and the like, deeply mines the self characteristics of the network technology new word, fully considers the meaning expressed in the sentence, splits the vocabulary from multiple dimensions and can improve the recall rate of the new word under the condition of ensuring the correct rate. In addition, the invention utilizes the largest public substring to merge the application fields of the new words, can avoid the redundancy of the prediction types of the application fields, improves the semantic understanding capability, and can provide powerful support for coping with network security early warning, network attack, illegal transaction and improper technology propagation through the prediction of the application fields of the new technology corresponding to the network technology new words.

Drawings

The invention will now be described by way of example and with reference to the accompanying drawings in which:

FIG. 1 is one embodiment of seed new word discovery and expansion.

Fig. 2 is one embodiment of scientific text data collection.

FIG. 3 is one embodiment of a first keyword weight dictionary, a second keyword weight dictionary, a third keyword weight dictionary update.

Fig. 4 is one embodiment of final network technology new word discovery and its application domain determination.

Fig. 5 is a diagram of an initial network technology new word iterative expansion module architecture.

Fig. 6 is a diagram of a data acquisition module architecture of a science and technology type website.

Fig. 7 is a diagram of network technology new words discovery and application domain prediction module architecture.

Detailed Description

All of the features disclosed in this specification, or all of the steps in a method or process disclosed, may be combined in any combination, except for mutually exclusive features and/or steps.

Any feature disclosed in this specification (including any accompanying claims, abstract) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise. That is, each feature is one example only of a generic series of equivalent or similar features, unless expressly stated otherwise.

Example 1

A network technology new word discovery and application field prediction method adopts the characteristics of multi-model mining keywords, so that the network technology new word can be discovered more accurately. The method comprises three parts, wherein the first part is to preliminarily determine new seed words and application fields thereof by utilizing a mode of manually collecting and obtaining similar words by using a Glove word vector model; the second part is to collect latest updated scientific text data in an external knowledge base such as scientific network news and paper data, weChat scientific public numbers, microblogs, knowledge science plates and the like; the third part is to use multiple NLP (Natural Language Processing ) models in fusion to determine new words of network technology and to predict corresponding application areas.

1. First part

In the portion, determining a first number of seed new words; expanding the number of seed new words from the collected corpus to a second number using a Glove word vector model; the application fields of various new sub-words are marked.

This part is mainly done manually. By collecting new technology and corresponding application fields popular for the last month (or other time period) on each large network platform and summarizing, a first representative number (here, 5 are taken as examples) of network technology new words are selected as seed new words. And then, repeatedly using GLove word vector models to find out similar words of the new seed words from the collected corpus, screening out the most similar first (e.g. 15) network technology new words, and expanding the number of the new seed words, namely expanding the number to a set second number. And setting the application field of each new word of the seeds by using a manual labeling mode. It should be noted that, the expansion to the second number may include 5 new words of the initial determination and 15 expanded similar words, or may only reserve 15 new words of the network technology expanded as new words of the seed. And (3) entering and exiting the final new seed words and the corresponding application fields into the MySQL database table.

As shown in fig. 1, the first portion, in some embodiments, comprises the steps of:

step 1: determining an initial seed new word: manually setting 5 current most popular network technology new words as initial seed new words;

step 2: manually obtaining text corpus for the new seed words which are obtained at present, wherein each corpus at least comprises one new seed word;

step 3: performing Glove word vector model training by using the corpus obtained in the step 2;

step 4: obtaining similar words of the new seed words by utilizing the Glove word vector model trained in the step 3;

Step 5: calculating the similarity between the similar words obtained in the step4 and the new words of the seeds, intelligently screening according to a preset third threshold value, selecting the similar words with the similarity reaching the third threshold value to expand the number of the new words of the seeds, and using a manual method to assist in confirmation;

step6: if the number after expansion is not 20, returning to the step 2, otherwise, manually determining the application field of the new word of the seed;

Step 7: and merging the new seed words and the corresponding application fields, and storing the new seed words and the corresponding application fields into a MySQL database table.

The first part is packaged into a prediction system, so that an initial network new word iteration expansion module can be obtained, namely a network technology new word discovery and application field prediction system can be designed, the system comprises the initial network new word iteration expansion module, the module is configured to execute the step of the first part, and the module architecture is shown in fig. 5.

2. Second part

The component collects the latest updated scientific text data from the external knowledge base.

The part adopts a general data acquisition technology to acquire structured data from an external knowledge base such as scientific network news and paper data, weChat science and technology type public numbers, microblogs, knowledge and technology plates and the like, and mainly acquires titles and contents of corresponding articles. In addition, for the content with pictures or videos, the part downloads and stores the corresponding pictures and videos into Minio, and extracts the corresponding text merging article titles and content by using an image and video text extraction algorithm and stores the text merging article titles and content into the Hive database.

As shown in fig. 2, in some embodiments, the portion includes the steps of:

Step 1: text data acquisition: collecting scientific text information updated recently (such as updated in the last month) from an external knowledge base (such as a science and technology website);

Step 2: for websites with scientific image information such as pictures and videos, corresponding scientific image information such as pictures and videos is also collected;

step 3: extracting text information in the picture or the video by using an image processing technology, and simultaneously storing the picture and the video into Minio;

step 4: and (5) combining the text information in the step 1 and the step 3 and storing the text information in Hive.

The second part is packaged into a prediction system, so that a scientific and technical website data acquisition module can be obtained, wherein the module is configured to execute the steps of the second part, and the module architecture is shown in fig. 6.

3. Third part

The part updates a first keyword weight dictionary of KeyBERT models and a second keyword weight dictionary of LAC models by using each piece of scientific text data; sequentially comparing each key value pair in the first keyword weight dictionary and the second keyword weight dictionary, and updating the third keyword weight dictionary with the corresponding key and the corresponding value when the edit distance of the two keys compared is within a first threshold; selecting a key with a value reaching a first condition from the third keyword weight dictionary as a final network technology new word; and carrying out semantic analysis on scientific text data corresponding to the final network technology new word so as to predict the application field of the final network technology new word. In addition, the application field of the final network technology new word is associated with the application field of the stored network technology new word, for example, whether the final network technology new word and the stored network technology new word are associated to the same application field can be determined according to the gap between the final network technology new word and the stored network technology new word. So-called stored network technology new words, which obviously contain seed new words in the first part, in particular network technology new words which have been stored in the database table in the history, and network technology new words which are updated into the database table over time. For example, the corpus collected by currently finding new network technology words is the last month, and the new network technology words found from the earlier corpus and the application field thereof are already stored in the database table before one month.

For the data stored in Hive, the part extracts new network technology words and application fields one by one for each new data acquired in the past day (or other time length), specifically, extracts corresponding new network technology words by using KeyBERT model and LAC model, attaches weights, adds weights of the same words or words with editing distance smaller than 1 to determine common new words and weights extracted by the two models, then arranges and selects a third number (10 in this case) of words before the weights in reverse order, calculates standardized weights corresponding to the words by using Min-Max standardized algorithm, and finally screens out words with standard weights larger than 0.5 as final network technology words. For application domain prediction, the system gathers the determined scientific text data containing new words of network technology, and obtains the related application domain by semantic analysis. In addition, the third part may further obtain a maximum public sub-string length of the final network technology new word and the stored network technology new word by using a maximum public sub-string algorithm, if the maximum public sub-string length is greater than 0, the final network technology new word is considered to be similar to the stored network technology new word, and the application fields of the two new words may be associated to a unified application field, for example, the application fields of the two new words are combined (i.e. a union of the two application fields) as an application field common to the two application fields. And finally, storing the obtained final network technology new words and the application field into a MySQL database table.

As shown in fig. 3, in some embodiments, the third portion includes the steps of:

Step 1: scientific text data of one day (or other time length) is read from Hive, and the scientific text data is stored into a character string list L to be extracted according to the sequence of each record;

Step 2: initializing KeyBERT a model and an LAC model and a combined keyword weight dictionary k_subject (namely a first keyword weight dictionary), l_subject (namely a second keyword weight dictionary), c_subject (namely a third keyword weight dictionary);

Step 3: sequentially reading each piece of scientific text data in the character string list L to be extracted in the step 1, and for each piece of scientific text data:

1) Extracting unigram keywords, bigram keywords and corresponding weights respectively by using KeyBERT models, wherein the two types of keywords correspond to the extracted weights respectively, and the key is taken as a key, and the sum of weight values of the two types of keywords is taken as a value to form a key value pair to update a keyword weight dictionary k_subject;

2) And similarly, extracting unigram, bigram keywords and corresponding weights by using the LAC model, and forming a key value pair to update a keyword weight dictionary l_subject by taking the keywords as keys and the sum of the weight values of the two types of keywords as values.

Step 4: the key value pairs { key1: value1}, { key2: value2}, key1, key2 being keys, value1, value2 being values, the two key value pairs extracted each time being corresponding, comparing the keys of the two key value pairs extracted, judging whether key1 is equal to key2, or whether the lycenstant (levenshtein) edit distance of both is not more than 1;

Step 5: if the condition in the step 4 is satisfied, a key2 is taken as a key, a value1+log (value 2) is taken as a value, or a key1 is taken as a key, a value2+lg (value 1) is taken as a value, and a key value pair is formed to update the keyword weight dictionary c_direct;

Step 6: and returning to the step 4 until the key comparison of the keyword weight dictionary k_subject and the keyword weight dictionary l_subject is completed, and then finishing updating the keyword weight dictionary c_subject.

As shown in fig. 4, further includes:

Step 7: sorting each key value pair in the updated keyword weight dictionary c_subject in the step 6 according to the weight reverse order (namely from small to large), selecting a first third number of key value pairs, and carrying out standardization of the [0,1] intervals on the weight values by using a Max-Min standardization algorithm;

Step 8: and (3) selecting the key with the standardized weight value more than or equal to 0.5 (namely the second threshold value) in the step (7) (namely the key forming a key value pair with the standardized weight value) as a new word of the final network technology.

Step 9: searching a corresponding application field in the character string list L to be extracted in the step 1 by using the final network technology new word (defined as kd 1) generated in the step 8 as a reference through a semantic analysis technology;

Step 10: reading the stored network technology new words (defined as kd 2) and the application fields thereof from the MySQL database table;

Step 11: comparing the new network technology words kd1 and kd2 in the step 9 and the step 10 in pairs, and obtaining the maximum common substring length len of the new network technology words by using a maximum substring algorithm find_lcs_ substr;

Step 12: if the maximum public substring length len calculated in the step 11 is more than 0, updating application fields corresponding to new network technology words kd1 and kd2 as a union of the application fields of the two, and writing the updated data into a MySQL database table;

step 13: if the maximum public sub-string length len=0 calculated in the step 11, directly adding the final network technology new word and the corresponding application field into a MySQL database table;

step 14: and returning to the step 11 until all the final network technology new words and application fields are added to the MySQL database table.

The third part is packaged into a system, so as to obtain a network technology new word discovery and application field prediction module, wherein the module is configured to execute the steps of the third part, and the module architecture is shown in fig. 7.

And finally, integrating the three modules (namely an initial network technology new word iteration expansion module, a scientific and technical website data acquisition module and a network technology new word discovery and application field prediction module) together to obtain the network technology new word discovery and application field prediction system.

Example 2

The embodiment describes the design thought of a network technology new word discovery and application field prediction system from the design principle. A processor is configured in the system to perform the method of embodiment 1. Alternatively, a computer readable storage medium having a computer program stored therein is configured to execute the method of embodiment 1. In the latter case, it is also necessary to provide a processor that reads the computer program from the computer storage medium to execute it.

The invention is not limited to the specific embodiments described above. The invention extends to any novel one, or any novel combination, of the features disclosed in this specification, as well as to any novel one, or any novel combination, of the steps of the method or process disclosed.

Claims

1.A network technology new word discovery and application field prediction method is characterized by comprising the following steps:

Collecting latest updated scientific text data from an external knowledge base;

Updating the first keyword weight dictionary of the KeyBERT model and the second keyword weight dictionary of the LAC model with each piece of the scientific text data, comprising: initializing KeyBERT a first keyword weight dictionary of the model and a second keyword weight dictionary of the LAC model; for each piece of scientific text data, extracting unigram keywords, bigram keywords and corresponding weights respectively by using KeyBERT models, and updating the first keyword weight dictionary by taking the keywords as keys and the sum of the weights as a value; extracting unigram key words, bigram key words and corresponding weights respectively by using the LAC model, and updating the second key word weight dictionary by taking the key words as keys and taking the sum of the weights as a value;

Sequentially comparing each key value pair in the first keyword weight dictionary and the second keyword weight dictionary, and updating a third keyword weight dictionary with the corresponding key and the corresponding value when the edit distance of the two keys compared is within a first threshold value, comprising: and respectively extracting a first key value pair { key1: value1} of the first keyword weight dictionary and a second key value pair { key2: value2} of the second keyword weight dictionary in sequence, and updating the third keyword weight dictionary as follows when the editing distance of the key1 = key2 or the editing distance of the key2 and the second key value pair { key2: value2} is not more than 1: updating the third keyword weight dictionary with a key2 as a key, with a value of value1+lg (value 2), or with a key1 as a key, with a value of value2+lg (value 1);

2. The network technology new word discovery and application field prediction method according to claim 1, wherein the associating the application field of the final network technology new word with the application field of the stored network technology new word includes:

3. The network technology new word discovery and application field prediction method according to claim 2, wherein a maximum common substring length between the final network technology new word and the stored network technology new word is calculated by using a maximum substring algorithm, and when the maximum common substring length is greater than zero, the final network technology new word and the stored network technology new word are associated to the same application field.

4. The network technology new word discovery and application field prediction method according to claim 1, wherein the collecting the latest updated scientific text data from the external knowledge base comprises:

5. The network technology new word discovery and application field prediction method according to claim 1, wherein selecting a key having a value reaching a first condition from the third keyword weight dictionary as a final network technology new word includes:

Screening out the first number of key value pairs;

6. The network technology new word discovery and application field prediction method of claim 1, wherein the determining a first number of seed new words; expanding the number of seed new words from the collected corpus to a second number using a Glove word vector model, comprising:

Setting a first number of seed new words;

Repeating the following steps until the seed new word reaches a second number:

collecting corpus, wherein each corpus at least comprises one new seed word;

training Glove word vector models by using the collected corpus;

7. A network technology new word discovery and application field prediction system, comprising a processor configured to perform the network technology new word discovery and application field prediction method according to any one of claims 1 to 6.

8. A network technology new word discovery and application field prediction system comprising a computer readable storage medium having a computer program stored therein, wherein the computer program is operated to perform the network technology new word discovery and application field prediction method according to any one of claims 1 to 6.