CN113128209B - Method and device for generating word stock - Google Patents

Method and device for generating word stock Download PDF

Info

Publication number
CN113128209B
CN113128209B CN202110437047.4A CN202110437047A CN113128209B CN 113128209 B CN113128209 B CN 113128209B CN 202110437047 A CN202110437047 A CN 202110437047A CN 113128209 B CN113128209 B CN 113128209B
Authority
CN
China
Prior art keywords
word
risk
keyword
target
initial
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110437047.4A
Other languages
Chinese (zh)
Other versions
CN113128209A (en
Inventor
杨德将
李原
郝萌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202110437047.4A priority Critical patent/CN113128209B/en
Publication of CN113128209A publication Critical patent/CN113128209A/en
Application granted granted Critical
Publication of CN113128209B publication Critical patent/CN113128209B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/08Insurance

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Finance (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Accounting & Taxation (AREA)
  • Health & Medical Sciences (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The present disclosure relates to a method and apparatus for generating word stock, an electronic device, a computer readable storage medium and a computer program product, and relates to the field of computers, and further relates to the technical field of data processing. The specific implementation scheme is as follows: acquiring an initial risk word; expanding the initial risk words to obtain expanded risk words; determining keyword information based on the initial risk word and the expanded risk word; based on each keyword in the keyword information, a target word stock is generated. The word stock accuracy can be improved through the implementation mode.

Description

Method and device for generating word stock
Technical Field
The present disclosure relates to the field of computers, and more particularly to the field of data processing technology, and in particular to a method and apparatus for generating word stock, an electronic device, a computer readable storage medium and a computer program product.
Background
At present, in the security industry, there are frequent risk problems such as security fraud. In order to reduce the risk, it is necessary to perform risk control.
In the risk control process, a risk word stock is often required to be set, and a policy with risks can be screened in advance to perform risk early warning based on matching of risk words in the risk word stock and the policy. In practice, the existing risk word stock is often obtained by manual accumulation and input determination, so that the problem of poor word stock accuracy exists.
Disclosure of Invention
The present disclosure provides a method and apparatus for generating word stock, an electronic device, a computer readable storage medium and a computer program product.
According to a first aspect, there is provided a method for generating a lexicon, comprising: acquiring an initial risk word; expanding the initial risk words to obtain expanded risk words; determining keyword information based on the initial risk word and the expanded risk word; based on each keyword in the keyword information, a target word stock is generated.
According to a second aspect, there is provided a method for risk detection, comprising: generating a target word stock based on the method for generating the word stock according to any one of the above; and performing risk detection on the target object based on the target word stock.
According to a third aspect, there is provided an apparatus for generating a thesaurus, comprising: a risk word acquisition unit configured to acquire an initial risk word; the risk word expansion unit is configured to expand the initial risk word to obtain an expanded risk word; an information determination unit configured to determine keyword information based on the initial risk word and the expanded risk word; and a thesaurus generation unit configured to generate a target thesaurus based on each keyword in the keyword information.
According to a fourth aspect, there is provided a risk detection apparatus, comprising the apparatus for generating a thesaurus and a risk detection unit configured to risk detect a target object based on the target thesaurus generated by the apparatus for generating a thesaurus.
According to a fifth aspect, there is provided an electronic device comprising: one or more processors; a memory for storing one or more programs; when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method as in any of the above.
According to a sixth aspect, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform any of the methods above.
According to a seventh aspect, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method as in any of the above.
According to the technology disclosed by the invention, a method for generating a word stock is provided, the initial risk words are expanded to obtain expanded risk words, and the coverage of the risk words can be enhanced by adopting the expanded risk words. Then, based on the initial risk word and the expanded risk word, keyword information can be determined, so that further information extraction of the risk word is realized. And then, generating a target word stock based on each keyword in the keyword information, wherein the obtained target word stock can reflect each keyword in the initial risk word and the expanded risk word, contains more and more accurate information quantity compared with the manually set word stock, and can improve the accuracy of the word stock.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is an exemplary system architecture diagram in which an embodiment of the present disclosure may be applied;
FIG. 2 is a flow chart of one embodiment of a method for generating a thesaurus in accordance with the present disclosure;
FIG. 3 is a schematic illustration of one application scenario of a method for generating a thesaurus in accordance with the present disclosure;
FIG. 4 is a flow chart of another embodiment of a method for generating a thesaurus in accordance with the present disclosure;
FIG. 5 is a flow chart of one embodiment of a risk detection method according to the present disclosure;
FIG. 6 is a schematic diagram of an embodiment of an apparatus for generating a thesaurus in accordance with the present disclosure;
FIG. 7 is a schematic diagram of a structure of one embodiment of a risk detection apparatus according to the present disclosure;
fig. 8 is a block diagram of an electronic device for implementing the methods of embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
It should be noted that, without conflict, the embodiments of the present disclosure and features of the embodiments may be combined with each other. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Fig. 1 is an exemplary system architecture diagram according to a first embodiment of the present disclosure, which illustrates an exemplary system architecture 100 to which embodiments of the method for generating word libraries of the present disclosure may be applied.
As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. The terminal devices 101, 102, 103 may be mobile phones, computers, tablet computers, etc., and in the terminal devices 101, 102, 103, insurance application software may be installed, where the insurance application software may implement insurance risk control, for example, identify users with risk and output corresponding prompts.
The terminal devices 101, 102, 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices including, but not limited to, televisions, smartphones, tablets, electronic book readers, car-mounted computers, laptop and desktop computers, and the like. When the terminal devices 101, 102, 103 are software, they can be installed in the above-listed electronic devices. Which may be implemented as multiple software or software modules (e.g., to provide distributed services), or as a single software or software module. The present invention is not particularly limited herein.
The server 105 may be a server providing various services, for example, may acquire initial risk words in the terminal devices 101, 102, 103, and obtain expanded risk words for the expanded initial risk words, determine keyword information based on the initial risk words and the expanded risk words, and generate a target word stock based on each keyword in the keyword information. After that, the server 105 may receive an instruction sent by the terminal device 101, 102, 103 for identifying whether the target user has a risk, determine a risk situation based on a matching situation between the target word stock and the target user, and return the risk situation to the terminal device 101, 102, 103.
The server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster formed by a plurality of servers, or as a single server. When server 105 is software, it may be implemented as a plurality of software or software modules (e.g., to provide distributed services), or as a single software or software module. The present invention is not particularly limited herein.
It should be noted that, the method for generating a word stock provided by the embodiment of the present disclosure may be performed by the terminal devices 101, 102, 103, or may be performed by the server 105. Accordingly, the means for generating the word stock may be provided in the terminal devices 101, 102, 103 or in the server 105.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to fig. 2, a flow 200 of one embodiment of a method for generating a thesaurus in accordance with the present disclosure is shown. The method for generating a word stock of the present embodiment includes the following steps:
step 201, an initial risk word is obtained.
In this embodiment, the execution body (such as the server 105 or the terminal devices 101, 102, 103 in fig. 1) may acquire the initial risk word first in the case where the word stock needs to be generated. The source of the initial risk word may be provided in advance for the business party, or may be a risk word obtained by analyzing the historical risk data, which is not limited in this embodiment. The number of the initial risk words may be one or more, and this embodiment is not limited to this. In the wind control link of the insurance industry, the initial risk word may be a word that causes a risk to a specified insurance, such as a disease type that is not applicable to the specified insurance, where the disease type is considered a risk word. In the wind control link of the advertising industry, the initial risk word may be a word that is not allowed to appear in the advertisement. If the source of the initial risk word is a word packet provided by the business party in advance, the execution body can carry out semantic recognition on the word packet so as to comb the word packet. Such as mapping each initial risk word with a corresponding risk category, and mapping the risk category and the initial risk word into JSON (JavaScript Object Notation, JS object profile)) files.
Step 202, expanding the initial risk word to obtain an expanded risk word.
In this embodiment, after the execution subject acquires the initial risk word, the execution subject may expand the initial risk word to obtain an expanded risk word. The expanded risk words refer to risk words with association relations with the initial risk words. For the determination of the expanded risk word, the execution subject can determine the risk word with the editing distance smaller than the threshold value from the initial risk word based on the semantic association determination of the initial risk word, and the risk word is used as the expanded risk word. Or, a corresponding risk word expansion database may be preset, and the execution subject directly searches the risk word expansion database for the risk word corresponding to the initial risk word as an expanded risk word. The setting of the risk word expansion database can be obtained based on analysis of historical risk data.
Step 203, determining keyword information based on the initial risk word and the expanded risk word.
In this embodiment, the executing entity can determine the keywords in the initial risk words and the expanded risk words, and then split the keywords into a plurality of keywords to form keyword information. The determination of the keywords may depend on preset determination rules. For example, "disease a" may be determined as a keyword for "mild disease a", "moderate disease a", "severe disease a". Each word in the keywords is a keyword. Optionally, determining the keyword information based on the initial risk word and the expanded risk word may include: inputting the initial risk words and the expanded risk words into a preset keyword determination model to obtain a plurality of corresponding keywords; determining and obtaining a plurality of keywords based on each word in the plurality of keywords; keyword information is determined based on a number of keywords. Further optionally, the execution body may sort each word in the plurality of keywords to obtain keyword information.
Step 204, generating a target word stock based on each keyword in the keyword information.
In this embodiment, each keyword in the keyword information may be ordered according to semantics, so that the keywords associated in front and back may together form a corresponding word. And the keyword information can also comprise word segmentation marks, wherein the word segmentation marks are used for dividing each keyword in the keyword information into corresponding phrases. The execution subject can identify word segmentation marks in the keyword information, and divide each keyword in the keyword information based on the word segmentation marks to obtain a plurality of phrases to form a target word stock.
For example, in a scenario where the insurance industry is under wind control, the executing body may analyze historical risk data, such as user data where there is a risk in the history, to obtain an initial risk word corresponding to the specified insurance. Thereafter, the execution body may expand the initial risk word. Specifically, the execution body may analyze the user data with the historic risk in advance, and determine the frequency of risk occurrence corresponding to each initial risk word; and carrying out semantic association on the initial risk words with the risk occurrence frequency from high to low in a designated number to obtain the expanded risk words. The semantic association may be implemented by some existing technical means, which is not described herein. Then, the execution body can determine keywords in the initial risk words and the expanded risk words, split the keywords into keywords, and perform different types of permutation and combination on the keywords to obtain keyword information, wherein the keyword information comprises keyword permutation sequences corresponding to each type. The execution body can traverse various types of keyword arrangement sequences, mark the position of the appointed keyword in each keyword arrangement sequence with a word segmentation mark based on semantic analysis, and optionally mark the position corresponding to the last keyword which can form the word with a word segmentation mark. Then, the execution body may divide the keywords in the arrangement sequence of the keywords based on the word segmentation markers, form a plurality of words, and generate a target word stock.
With continued reference to fig. 3, a schematic diagram of one application scenario of a method for generating a thesaurus according to the present disclosure is shown. In the application scenario of fig. 3, the executive body may be used for risk control in the insurance industry. The executing body may first obtain an initial risk word 301 corresponding to the specified insurance, where the initial risk word 301 may include "disease a" and "spoof guarantee". Then, the execution subject may expand the initial risk word 301 to obtain an expanded risk word 302, where the expanded risk word 302 may include "mild disease a" and "moderate disease a" obtained by expanding "disease a", and may also include "safe cheat-guard" and "industrial injury cheat-guard" obtained by expanding "cheat-guard". The executing entity may then determine keywords in the initial risk word 301 and the expanded risk word 302, such as "disease a" and "spoof," industrial injury spoof, "and" insurance spoof. Then, the execution subject splits the keyword into individual words, resulting in keyword information 303, the keyword information 303 including "disease", "a", "spoof", "guard", "worker", "injury", "guard". Based on the arrangement and combination of the keywords, a target word stock 304 may be obtained, and the target word stock 304 may include "disease a", "spoof, disease a spoof", "disease a insurance spoof", "disease a industrial injury spoof", and the like.
According to the method for generating the word stock, provided by the embodiment of the disclosure, the expanded risk words are obtained by expanding the initial risk words, and the coverage of the risk words can be enhanced by adopting the expanded risk words. Then, based on the initial risk word and the expanded risk word, keyword information can be determined, so that further information extraction of the risk word is realized. And then, generating a target word stock based on each keyword in the keyword information, wherein the obtained target word stock can reflect each keyword in the initial risk word and the expanded risk word, contains more and more accurate information quantity compared with the manually set word stock, and can improve the accuracy of the word stock.
With continued reference to fig. 4, a flow 400 of another embodiment of a method for generating a thesaurus according to the present disclosure is shown. As shown in fig. 4, the method for generating a word stock of the present embodiment may include the steps of:
step 401, obtain an initial risk word.
In this embodiment, the detailed description of step 401 is referred to the detailed description of step 201, and will not be repeated here.
Step 402, determining editing distance and/or semantic similarity between each candidate expansion word and the initial risk word in a preset candidate expansion word library.
In this embodiment, the preset candidate expanded word library is a word library determined in advance by means of manual setting, data crawling and the like, and may include a large number of risk words. After the execution main body acquires the initial risk words, initial classification of the initial risk words can be determined, a plurality of candidate expansion words are screened out from a preset candidate expansion word bank, and editing distances and/or semantic similarity between the plurality of candidate expansion words and the initial risk words are determined. The editing distance refers to the number of processing times required for conversion between two words, and the semantic similarity refers to the degree of similarity of semantics between two words obtained based on semantic analysis. The shorter the edit distance, the higher the semantic similarity, the stronger the association between the candidate expanded word and the initial risk word.
In some optional implementations of this embodiment, the preset candidate expansion word stock is determined by: acquiring historical risk information; and determining a preset candidate expansion word library based on the occurrence frequency of each risk word in the historical risk information.
In this implementation manner, the historical risk information may be information that risk has occurred in the history, such as corpus of users that risk has occurred in the history. The execution body can segment the historical risk information to obtain a plurality of risk words appearing in the historical risk information. And adding a preset number of words with higher occurrence frequency of the risk words in the historical risk information into the word stock to obtain a preset candidate expanded word stock. If the risk words with the first three occurrence frequencies in the historical risk information are added into the word stock, a preset candidate expansion word stock is obtained.
Step 403, determining expansion risk words in the candidate expansion words based on the editing distance and/or the semantic similarity.
In this embodiment, the shorter the editing distance between the candidate expanded word and the initial risk word, the higher the semantic similarity, the stronger the association between the candidate expanded word and the initial risk word is explained. Therefore, a preset number of candidate expansion words can be selected as expansion risk words according to the sequence of the editing distance from short to long and/or the semantic similarity from high to low.
Step 404, determining keyword information based on the initial risk word and the expanded risk word.
In this embodiment, the description of step 404 is referred to the description of step 203, and will not be repeated here.
Step 405, determining a keyword set based on each keyword in the keyword information and a preset keyword sequence.
In this embodiment, since each keyword in the keyword information is obtained by splitting the keywords of the initial risk word and the expanded risk word, the preset keyword sequence may be the position sequence of the keyword in the keywords. It should be noted that, for the case that more than two keywords exist in the initial risk word and the expanded risk word, for example, the "industrial injury" and the "industrial injury" in the "industrial injury cheating insurance" can be respectively split into keywords, so as to obtain keywords corresponding to each keyword, and when the keywords corresponding to different keywords are ranked, different types of ranking can be included, for example, the "industrial injury cheating insurance" and the "industrial injury cheating insurance". The execution body may sort each keyword in the keyword information according to a preset keyword sequence, to obtain a keyword set.
In some optional implementations of the present embodiment, determining the keyword set based on each keyword in the keyword information and the preset keyword order includes: traversing each keyword in the keyword information in the initial dictionary tree according to a preset keyword sequence; for each keyword which is not stored in advance, storing the keyword and the next keyword of the keyword in an initial dictionary tree in an associated mode according to a preset keyword sequence to obtain a target dictionary tree; a set of keywords is determined based on the individual keywords in the target dictionary tree.
In this implementation, the execution body may generate the target word stock by creating a dictionary tree. The dictionary tree refers to a tree-shaped storage structure, and each node except the root node only comprises one character. Each key in the key information may be stored in a respective node in the dictionary tree other than the root node. Specifically, the execution body may traverse each keyword in the initial dictionary tree according to a preset keyword sequence, that is, a sequence of multiple types of keywords. The initial dictionary tree refers to a dictionary tree in which data needs to be written, and the execution subject can directly read the initial dictionary tree from the memory. In the process of traversing the keywords, if the current keywords are not stored in advance, the current keywords and the next keywords are stored in a correlated mode according to key-value pairs. Wherein, keywords which are not pre-stored refer to keywords which are not pre-stored in the initial dictionary tree in the keyword information. The next keyword is determined according to a preset keyword sequence. If the current keyword is stored in advance, determining a value corresponding to the current keyword, traversing the next keyword based on the value corresponding to the current keyword until traversing is finished, and obtaining the target dictionary tree. Wherein, the pre-stored keywords refer to keywords pre-stored in an initial dictionary tree in the keyword information. This process may store individual keywords in the keyword information to the target dictionary tree. Alternatively, in the case that it is determined that the current keyword is the last word corresponding to the keyword, a word segmentation flag may be added to the storage location of the current keyword. Further optionally, after adding the word segmentation markers at the storage location of the current keyword, a corresponding risk category may also be added.
Step 406, determining word segmentation locations in the keyword set.
In this embodiment, the executing entity adds a word segmentation marker at the location of the last word of the keyword. By identifying the word segmentation markers, the execution body can determine the positions of the word segmentation markers as word segmentation positions in the keyword set. Wherein the word segmentation location is used to divide the keywords into different words. In addition to adding a word segmentation marker based on the position of the last word of the keyword, determining the word segmentation position, optionally, determining the word segmentation position in the keyword set may further include: and identifying various keyword combinations in the keyword set, judging whether the keyword combinations are normal words, and if so, adding word segmentation marks based on the keyword combinations. For example, a keyword corresponds to a plurality of keywords, wherein one keyword can be used as abbreviation of the keyword to represent the meaning of the keyword. The keyword can be used as a keyword combination, word segmentation marks are added at the front and back positions of the keyword, and the word segmentation positions are determined based on the word segmentation marks.
Step 407, dividing the keyword set into at least one target word based on the word segmentation position.
In this embodiment, since the word segmentation positions are used to describe the division positions of the keywords, by splitting each word segmentation position, keywords between any two word segmentation positions can be sequentially combined into one target word, so as to obtain at least one target word.
Step 408, generating a target word stock based on the at least one target word.
In this embodiment, at least one target word and a risk category corresponding to each target word may be stored correspondingly, so as to generate a target word stock.
In some optional implementations of the present embodiment, generating the target word stock based on the at least one target word includes: for each target word in at least one target word, determining a risk category corresponding to the target word; and generating a target word library based on at least one target word and the risk category corresponding to each target word.
In this implementation manner, a risk category corresponding to the target word may be preset at the position of the word segmentation marker, the setting of the risk category may be determined based on semantic classification, the semantic of the target word is analyzed, and the category matched with the target word is determined. In the process of generating the target word stock, the target word and the risk category can be correspondingly stored. For example, the risk category may include a fraud risk, etc., and in the process of generating the target word stock, each target word corresponding to the fraud risk and the fraud risk may be stored in association, and each target word corresponding to the fraud risk and the fraud risk may be stored in association. According to the risk detection method based on the target word stock, when risk detection is carried out based on the target word stock, risk categories corresponding to the risk words can be found out quickly, and targeted output of risk early warning is facilitated.
Step 409, storing a snapshot of the target thesaurus.
In this embodiment, the execution body may store a snapshot of the target thesaurus for loading the target thesaurus based on the snapshot.
In step 410, a risk positive sample set and/or a risk negative sample set is obtained.
In this embodiment, the risk positive sample set is a sample set with risk, and the risk negative sample set is a sample set without risk. For example, the positive risk sample set may be a set of risk users, and the negative risk sample set may be a set of non-risk users.
In step 411, the target thesaurus is updated based on the risk positive sample set and/or the risk negative sample set.
In this embodiment, for the risk positive sample set, each risk positive sample may be traversed sequentially, and each word in the target word stock is used to match the risk positive sample, so as to obtain a classification of the risk positive sample, where there is a risk or no risk. If the derived classification is risky, the traversal continues. And if the obtained classification is that the risk does not exist, updating the target risk word in the risk positive sample into a target word stock. For example, updating the target risk words with the frequency of occurrence from high to low in the risk positive sample to a target word stock. And for the risk negative sample set, determining the risk words with the occurrence frequency from high to low in the preset number in each risk negative sample, judging whether the risk words are in the target word stock, and deleting if so, thereby realizing the updating of the target word stock.
The method for generating the word stock provided by the embodiment of the present disclosure may further sort keywords according to a preset keyword sequence, obtain a keyword set, and determine the word segmentation position based on the keyword segmentation position corresponding to the keywords and the semantic association condition of the keyword combination. And then dividing based on the word segmentation position to obtain at least one target word. The target word stock generated by the target words can cover more types of words, such as abbreviations of keywords, combination words of different positions of the keywords, each keyword and the like, so that the coverage of the target word stock is improved. In addition, the target words and the risk categories can be correspondingly stored, so that the target words can be determined directly according to the risk categories, and the determination and searching efficiency of the risk words is higher. And the method can also be used as a determination basis of the expansion risk words based on the editing distance and the semantic similarity, so that more accurate expansion risk words can be determined and obtained. And further combining the occurrence frequency of each risk word in the historical risk information as a determination basis of the expanded risk word, wherein the expanded risk word is more accurate, and the word stock content is more abundant. And the supervised target word stock update can be realized based on the risk positive sample set and/or the risk negative sample set, so that the real-time performance of the target word stock is improved, and the accuracy is higher. In addition, the snapshot of the target word stock is stored, so that the target word stock can be conveniently updated by directly reading the snapshot, and the word stock is more convenient to update.
With continued reference to fig. 5, a flow 500 of one embodiment of a risk detection method according to the present disclosure is shown. As shown in fig. 5, the risk detection method of the present embodiment may include the steps of:
step 501, a target word stock is generated.
In this embodiment, the execution subject may generate the target thesaurus based on the above-described method for generating a thesaurus.
Step 502, performing risk detection on the target object based on the target word stock.
In this embodiment, the target object may be various forms of data such as text, image, etc., for example, the policy text of the target user. Based on the processing such as text analysis and image recognition on the target object, a plurality of keywords contained in the target object are recognized, and the probability that the target object has risk can be determined by matching the plurality of keywords contained in the target object with the target word stock. If the matching degree of a plurality of keywords contained in the target object and each word in the target word stock is higher, the probability of risk is indicated to be higher, and therefore risk early warning is achieved.
With further reference to fig. 6, as an implementation of the method shown in the foregoing figures, the present disclosure provides an embodiment of an apparatus for generating a word stock, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be specifically applied to various servers.
As shown in fig. 6, the apparatus 600 for generating a word stock of the present embodiment includes: a risk word acquisition unit 601, a risk word expansion unit 602, an information determination unit 603, and a thesaurus generation unit 604.
The risk word acquisition unit 601 is configured to acquire an initial risk word.
The risk word expansion unit 602 is configured to expand the initial risk word to obtain an expanded risk word.
The information determination unit 603 is configured to determine keyword information based on the initial risk word and the expanded risk word.
The thesaurus generation unit 604 is configured to generate a target thesaurus based on each keyword in the keyword information.
In some optional implementations of the present embodiment, the thesaurus generation unit 604 is further configured to: determining a keyword set based on each keyword in the keyword information and a preset keyword sequence; determining word segmentation positions in the keyword set; dividing the keyword set into at least one target word based on the word segmentation position; a target word stock is generated based on the at least one target word.
In some optional implementations of the present embodiment, the thesaurus generation unit 604 is further configured to: traversing each keyword in the keyword information in the initial dictionary tree according to a preset keyword sequence; for each keyword which is not stored in advance, storing the keyword and the next keyword of the keyword in an initial dictionary tree in an associated mode according to a preset keyword sequence to obtain a target dictionary tree; a set of keywords is determined based on the individual keywords in the target dictionary tree.
In some optional implementations of the present embodiment, the thesaurus generation unit 604 is further configured to: for each target word in at least one target word, determining a risk category corresponding to the target word; and generating a target word library based on at least one target word and the risk category corresponding to each target word.
In some optional implementations of the present embodiment, the risk word extension unit 602 is further configured to: in a preset candidate expanded word library, determining editing distance and/or semantic similarity between each candidate expanded word and an initial risk word; based on the edit distance and/or semantic similarity, an expanded risk word is determined among the candidate expanded words.
In some optional implementations of the present embodiment, the risk word extension unit 602 is further configured to: acquiring historical risk information; and determining a preset candidate expansion word library based on the occurrence frequency of each risk word in the historical risk information.
In some optional implementations of this embodiment, the apparatus further includes: a set acquisition unit configured to acquire a risk positive sample set and/or a risk negative sample set; and the word stock updating unit is configured to update the target word stock based on the risk positive sample set and/or the risk negative sample set.
In some optional implementations of this embodiment, the apparatus further includes: and the snapshot storage unit is configured as a snapshot of the target word stock.
It should be understood that the units 601 to 604 recited in the apparatus 600 for generating word stock correspond to the respective steps in the method described with reference to fig. 2. Thus, the operations and features described above with respect to the method of generating a thesaurus are equally applicable to the apparatus 600 and the elements contained therein, and are not described in detail herein.
With continued reference to fig. 7, as an implementation of the method shown in the foregoing figures, the present disclosure provides an embodiment of a risk detection apparatus, which corresponds to the method embodiment shown in fig. 5, and which is particularly applicable to various servers. The risk detection apparatus 700 of the present embodiment includes the above-described apparatus 600 for generating a word stock and a risk detection unit 701; wherein,
the risk detection unit 701 is configured to perform risk detection on the target object based on the target word stock generated by the apparatus for generating word stock 500.
In this embodiment, the risk detection unit 701 corresponds to step 502, and the operations and features described above for step 502 are equally applicable to the risk detection unit 701, and are not described herein.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product. The electronic device for generating the word stock comprises: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to implement the method for generating word stock or the risk detection method. The readable storage medium stores computer instructions for causing a computer to perform the above-described method for generating a thesaurus or risk detection method. The computer program product comprises a computer program which, when executed by a processor, implements the above-described method for generating a lexicon or risk detection method.
Fig. 8 illustrates a block diagram of an electronic device 800 for implementing a method for generating word stock in accordance with an embodiment of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 8, the device 800 includes a processor 801 that can perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM803, various programs and data required for the operation of the device 800 can also be stored. The processor 801, the ROM802, and the RAM803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.
Various components in device 800 are connected to I/O interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The processor 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of processor 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 801 performs the various methods and processes described above, such as methods for generating word stock. For example, in some embodiments, the method for generating a thesaurus may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 800 via ROM802 and/or communication unit 809. When a computer program is loaded into RAM803 and executed by processor 801, one or more steps of the method for generating word stock described above may be performed. Alternatively, in other embodiments, the processor 801 may be configured to perform the method for generating word stock by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (18)

1. A method for generating a lexicon, comprising:
acquiring an initial risk word;
expanding the initial risk words to obtain expanded risk words;
determining keyword information based on the initial risk word and the expanded risk word; the keyword information comprises a plurality of keywords and word segmentation marks, wherein the word segmentation marks are used for dividing each keyword in the keyword information into corresponding phrases;
generating a target word stock based on each keyword in the keyword information;
wherein the generating a target word stock based on each keyword in the keyword information includes:
determining a keyword set based on each keyword in the keyword information and a preset keyword sequence;
determining the word segmentation position from the position of the word segmentation mark in the keyword set;
Dividing the keyword set into at least one target word based on the word segmentation position;
and generating the target word stock based on the at least one target word.
2. The method of claim 1, wherein the determining a set of keywords based on each keyword in the keyword information and a preset keyword order comprises:
traversing each keyword in the keyword information in an initial dictionary tree according to the preset keyword sequence;
for each keyword which is not stored in advance, storing the keyword and the next keyword of the keyword in the initial dictionary tree in an associated mode according to the preset keyword sequence to obtain a target dictionary tree;
the set of keywords is determined based on the individual keywords in the target dictionary tree.
3. The method of claim 1, wherein the generating the target word stock based on the at least one target word comprises:
for each target word in the at least one target word, determining a risk category corresponding to the target word;
and generating the target word stock based on the at least one target word and the risk category corresponding to each target word.
4. The method of claim 1, wherein the expanding the initial risk word to obtain an expanded risk word comprises:
determining editing distance and/or semantic similarity between each candidate expansion word and the initial risk word in a preset candidate expansion word library;
and determining the expansion risk words in the candidate expansion words based on the editing distance and/or the semantic similarity.
5. The method of claim 4, wherein the predetermined candidate expansion word stock is determined by:
acquiring historical risk information;
and determining the preset candidate expanded word library based on the occurrence frequency of each risk word in the historical risk information.
6. The method of claim 1, wherein the method further comprises:
acquiring a risk positive sample set and/or a risk negative sample set;
and updating the target word stock based on the risk positive sample set and/or the risk negative sample set.
7. The method of any one of claims 1 to 6, wherein the method further comprises:
and storing the snapshot of the target word stock.
8. A risk detection method, wherein the method comprises:
Generating a target word stock based on the method of any one of claims 1-7;
and performing risk detection on the target object based on the target word stock.
9. An apparatus for generating a thesaurus, comprising:
a risk word acquisition unit configured to acquire an initial risk word;
the risk word expansion unit is configured to expand the initial risk word to obtain an expanded risk word;
an information determination unit configured to determine keyword information based on the initial risk word and the expanded risk word; the keyword information comprises a plurality of keywords and word segmentation marks, wherein the word segmentation marks are used for dividing each keyword in the keyword information into corresponding phrases;
a thesaurus generation unit configured to generate a target thesaurus based on each keyword in the keyword information;
wherein the thesaurus generation unit is further configured to:
determining a keyword set based on each keyword in the keyword information and a preset keyword sequence;
determining the word segmentation position from the position of the word segmentation mark in the keyword set;
dividing the keyword set into at least one target word based on the word segmentation position;
And generating the target word stock based on the at least one target word.
10. The apparatus of claim 9, wherein the thesaurus generation unit is further configured to:
traversing each keyword in the keyword information in an initial dictionary tree according to the preset keyword sequence;
for each keyword which is not stored in advance, storing the keyword and the next keyword of the keyword in the initial dictionary tree in an associated mode according to the preset keyword sequence to obtain a target dictionary tree;
the set of keywords is determined based on the individual keywords in the target dictionary tree.
11. The apparatus of claim 9, wherein the thesaurus generation unit is further configured to:
for each target word in the at least one target word, determining a risk category corresponding to the target word;
and generating the target word stock based on the at least one target word and the risk category corresponding to each target word.
12. The apparatus of claim 9, wherein the risk word extension unit is further configured to:
determining editing distance and/or semantic similarity between each candidate expansion word and the initial risk word in a preset candidate expansion word library;
And determining the expansion risk words in the candidate expansion words based on the editing distance and/or the semantic similarity.
13. The apparatus of claim 12, wherein the risk word extension unit is further configured to:
acquiring historical risk information;
and determining the preset candidate expanded word library based on the occurrence frequency of each risk word in the historical risk information.
14. The apparatus of claim 9, wherein the apparatus further comprises:
a set acquisition unit configured to acquire a risk positive sample set and/or a risk negative sample set;
and the word stock updating unit is configured to update the target word stock based on the risk positive sample set and/or the risk negative sample set.
15. The apparatus according to any one of claims 9 to 14, wherein the apparatus further comprises:
and the snapshot storage unit is configured to be a snapshot of the target word stock.
16. A risk detection device, wherein the device comprises a device for generating a thesaurus as claimed in any one of the claims 9-15 and a risk detection unit;
the risk detection unit is configured to perform risk detection on a target object based on a target word stock generated by the means for generating word stock.
17. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.
18. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-8.
CN202110437047.4A 2021-04-22 2021-04-22 Method and device for generating word stock Active CN113128209B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110437047.4A CN113128209B (en) 2021-04-22 2021-04-22 Method and device for generating word stock

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110437047.4A CN113128209B (en) 2021-04-22 2021-04-22 Method and device for generating word stock

Publications (2)

Publication Number Publication Date
CN113128209A CN113128209A (en) 2021-07-16
CN113128209B true CN113128209B (en) 2023-11-24

Family

ID=76779527

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110437047.4A Active CN113128209B (en) 2021-04-22 2021-04-22 Method and device for generating word stock

Country Status (1)

Country Link
CN (1) CN113128209B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114238619B (en) * 2022-02-23 2022-04-29 成都数联云算科技有限公司 Method, system, device and medium for screening Chinese nouns based on edit distance
CN114661974B (en) * 2022-03-21 2024-03-08 重庆市规划和自然资源信息中心 Government website public opinion analysis and early warning method by utilizing natural language semantic analysis
CN114661934B (en) * 2022-03-21 2024-03-01 重庆市规划和自然资源信息中心 Method for multidimensional monitoring of government new media public opinion early warning based on data mining analysis technology

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20130033694A (en) * 2011-09-27 2013-04-04 엔에이치엔비즈니스플랫폼 주식회사 Method, apparatus and computer readable recording medium for generating exetension data-set of concept keywords
CN104268176A (en) * 2012-06-26 2015-01-07 北京奇虎科技有限公司 Recommendation method and system based on search keyword
CN105653660A (en) * 2015-12-29 2016-06-08 云南电网有限责任公司电力科学研究院 Association method and device of retrieval keyword
CN108170664A (en) * 2017-11-29 2018-06-15 有米科技股份有限公司 Keyword expanding method and device based on emphasis keyword
CN108197098A (en) * 2017-11-22 2018-06-22 阿里巴巴集团控股有限公司 A kind of generation of keyword combined strategy and keyword expansion method, apparatus and equipment
CN108984596A (en) * 2018-06-01 2018-12-11 阿里巴巴集团控股有限公司 A kind of keyword excavates and the method, device and equipment of risk feedback
CN110287493A (en) * 2019-06-28 2019-09-27 中国科学技术信息研究所 Risk phrase chunking method, apparatus, electronic equipment and storage medium
CN110909539A (en) * 2019-10-15 2020-03-24 平安科技(深圳)有限公司 Word generation method, system, computer device and storage medium of corpus

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104516902A (en) * 2013-09-29 2015-04-15 北大方正集团有限公司 Semantic information acquisition method and corresponding keyword extension method and search method

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20130033694A (en) * 2011-09-27 2013-04-04 엔에이치엔비즈니스플랫폼 주식회사 Method, apparatus and computer readable recording medium for generating exetension data-set of concept keywords
CN104268176A (en) * 2012-06-26 2015-01-07 北京奇虎科技有限公司 Recommendation method and system based on search keyword
CN105653660A (en) * 2015-12-29 2016-06-08 云南电网有限责任公司电力科学研究院 Association method and device of retrieval keyword
CN108197098A (en) * 2017-11-22 2018-06-22 阿里巴巴集团控股有限公司 A kind of generation of keyword combined strategy and keyword expansion method, apparatus and equipment
CN108170664A (en) * 2017-11-29 2018-06-15 有米科技股份有限公司 Keyword expanding method and device based on emphasis keyword
CN108984596A (en) * 2018-06-01 2018-12-11 阿里巴巴集团控股有限公司 A kind of keyword excavates and the method, device and equipment of risk feedback
CN110287493A (en) * 2019-06-28 2019-09-27 中国科学技术信息研究所 Risk phrase chunking method, apparatus, electronic equipment and storage medium
CN110909539A (en) * 2019-10-15 2020-03-24 平安科技(深圳)有限公司 Word generation method, system, computer device and storage medium of corpus

Also Published As

Publication number Publication date
CN113128209A (en) 2021-07-16

Similar Documents

Publication Publication Date Title
CN113128209B (en) Method and device for generating word stock
CN113094559B (en) Information matching method, device, electronic equipment and storage medium
CN113836314B (en) Knowledge graph construction method, device, equipment and storage medium
CN112988753B (en) Data searching method and device
CN113792154A (en) Method and device for determining fault association relationship, electronic equipment and storage medium
CN112818686A (en) Domain phrase mining method and device and electronic equipment
CN113988157A (en) Semantic retrieval network training method and device, electronic equipment and storage medium
CN113408660A (en) Book clustering method, device, equipment and storage medium
US20220198358A1 (en) Method for generating user interest profile, electronic device and storage medium
CN112925883A (en) Search request processing method and device, electronic equipment and readable storage medium
CN113660541A (en) News video abstract generation method and device
CN112699237B (en) Label determination method, device and storage medium
CN113408280A (en) Negative example construction method, device, equipment and storage medium
CN113963197A (en) Image recognition method and device, electronic equipment and readable storage medium
CN114818736B (en) Text processing method, chain finger method and device for short text and storage medium
CN116166814A (en) Event detection method, device, equipment and storage medium
CN113239149B (en) Entity processing method, device, electronic equipment and storage medium
CN112926297B (en) Method, apparatus, device and storage medium for processing information
CN114048315A (en) Method and device for determining document tag, electronic equipment and storage medium
KR20220024251A (en) Method and apparatus for building event library, electronic device, and computer-readable medium
US11379669B2 (en) Identifying ambiguity in semantic resources
CN114492364A (en) Same vulnerability judgment method, device, equipment and storage medium
CN112818167B (en) Entity retrieval method, entity retrieval device, electronic equipment and computer readable storage medium
US20230095947A1 (en) Method and apparatus for pushing resource, and storage medium
CN114861062B (en) Information filtering method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant