CN110032734B - Training method and device for similar meaning word expansion and generation of confrontation network model - Google Patents

Training method and device for similar meaning word expansion and generation of confrontation network model Download PDF

Info

Publication number
CN110032734B
CN110032734B CN201910204138.6A CN201910204138A CN110032734B CN 110032734 B CN110032734 B CN 110032734B CN 201910204138 A CN201910204138 A CN 201910204138A CN 110032734 B CN110032734 B CN 110032734B
Authority
CN
China
Prior art keywords
word
words
keywords
meaning
generated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910204138.6A
Other languages
Chinese (zh)
Other versions
CN110032734A (en
Inventor
刘焱
吕中厚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201910204138.6A priority Critical patent/CN110032734B/en
Publication of CN110032734A publication Critical patent/CN110032734A/en
Application granted granted Critical
Publication of CN110032734B publication Critical patent/CN110032734B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a training method and a device for expanding a similar meaning word and generating an confrontation network model, wherein the method for expanding the similar meaning word comprises the following steps: acquiring keywords to be processed; searching out the near meaning words of the key words from the generated candidate word set by using a word vector tool; and respectively generating the keyword and the similar meaning words of the searched similar meaning words by utilizing a GAN model obtained by pre-training. By applying the scheme of the invention, the processing efficiency can be improved.

Description

Training method and device for similar meaning word expansion and generation of confrontation network model
[ technical field ] A method for producing a semiconductor device
The invention relates to a computer application technology, in particular to a training method and a training device for expanding a similar meaning word and generating an confrontation network model.
[ background of the invention ]
Currently, a large number of internet applications allow users to post, post back, comment, etc., which may be collectively referred to as User Generated Content (UGC).
Different applications usually have stricter regulations on the content of the UGC, and therefore, the content of the UGC needs to be checked.
The common auditing mode is to filter based on keywords, and the richness of the keywords directly influences the filtering effect. At present, a way of summarizing keywords by manual summarization is usually adopted, but the way is difficult to cover all situations and is easy to bypass.
In order to avoid being bypassed, the keywords summarized and summarized manually need to be expanded, so that the synonyms of the keywords are expanded as much as possible, but the prior art mainly depends on a manual mining mode, and the efficiency is low.
[ summary of the invention ]
In view of the above, the present invention provides training methods and apparatuses for synonym expansion and generation of confrontation network models.
The specific technical scheme is as follows:
a method of synonym expansion, comprising:
acquiring keywords to be processed;
finding out the similar meaning words of the keywords from the generated candidate word set by using a word vector tool;
and respectively generating the keyword and the similar meaning words of the searched similar meaning words by utilizing a generated confrontation network GAN model obtained by pre-training.
According to a preferred embodiment of the present invention, the finding the near-sense word of the keyword from the generated candidate word set by using the word vector tool comprises:
inputting the keywords to the word vector tool, obtaining the word vector tool, and calculating the distance between the word vector representation of each alternative word and the word vector representation of the keywords respectively, then selecting and returning N alternative words closest to the keywords, and taking the returned alternative words as the synonyms of the keywords, wherein N is a positive integer.
According to a preferred embodiment of the present invention, the method for generating the candidate word set includes:
collecting user generated content UGC data;
and carrying out word segmentation on the UGC data, and taking word segmentation results as alternative words.
According to a preferred embodiment of the present invention, the generating the keyword and the searched near-meaning word of the near-meaning word respectively by using a GAN model obtained by pre-training comprises:
and aiming at each word in the keyword and the searched similar meaning words, respectively inputting the word and noise into the GAN model to obtain the similar meaning words of the words generated by the GAN model.
According to a preferred embodiment of the invention, the method further comprises: and aiming at the same word, respectively inputting different noises into the GAN model to obtain different similar words of the word generated by the GAN model.
A training method for generating an antagonistic network (GAN) model comprises the following steps:
obtaining training samples, each training sample comprising: an original word and a near-meaning word of the original word;
training the GAN model according to the training sample so as to search out the near-meaning words of the keywords from the generated candidate word set by using a word vector tool aiming at the keywords to be processed when the near-meaning words are expanded, and then respectively generating the keywords and the near-meaning words of the searched near-meaning words by using the GAN model.
According to a preferred embodiment of the present invention, the synonym of the original word is a modified word of the original word, and includes one or a combination of the following: removing part of the content in the original word, and replacing part or all of the content in the original word;
for each word in the keyword and the searched similar meaning words, the similar meaning words of the word generated by the GAN model are variant words of the word, and the variant words include one or a combination of the following: removing part of the content of the words, and replacing part or all of the content of the words.
According to a preferred embodiment of the present invention, the replacing of part or all of the content of the words comprises one or any combination of the following: replacing at least one character in the words with pinyin, replacing at least one character in the words with pinyin first letters, and replacing at least one character in the words with other characters with similar pronunciation.
A synonym expansion device comprising: a first extension unit and a second extension unit;
the first expansion unit is used for acquiring keywords to be processed and finding out near-meaning words of the keywords from the generated alternative word set by using a word vector tool;
and the second expansion unit is used for respectively generating the keyword and the similar meaning words of the searched similar meaning words by utilizing a generated confrontation network GAN model obtained by training in advance.
According to a preferred embodiment of the present invention, the first expansion unit inputs the keyword to the word vector tool, the word vector tool obtains N candidate words closest to the keyword, which are selected and returned after calculating distances between word vector representations of the candidate words and word vector representations of the keyword, respectively, and the returned candidate words are used as synonyms of the keyword, where N is a positive integer.
According to a preferred embodiment of the present invention, the first extension unit is further configured to collect user-generated content UGC data, perform word segmentation processing on the UGC data, and use a word segmentation result as an alternative word.
According to a preferred embodiment of the present invention, the second expanding unit inputs the word and noise into the GAN model for each word of the keyword and the searched near-meaning words, respectively, to obtain the near-meaning words of the word generated by the GAN model.
According to a preferred embodiment of the present invention, the second expanding unit is further configured to, for a same word, input different noises to the GAN model respectively, so as to obtain different synonyms of the word generated by the GAN model.
A generate confrontation network GAN model training apparatus, comprising: a sample obtaining unit and a model training unit;
the sample obtaining unit is configured to obtain training samples, where each training sample includes: an original word and a near-meaning word of the original word;
the model training unit is used for training the GAN model according to the training sample so as to search out the synonym of the keyword from the generated candidate word set by using a word vector tool for the keyword to be processed when the synonym is expanded, and then respectively generating the keyword and the synonym of the searched synonym by using the GAN model.
According to a preferred embodiment of the present invention, the synonym of the original word is a modified word of the original word, and includes one or a combination of the following: removing part of the content in the original word, and replacing part or all of the content in the original word;
for each word in the keyword and the searched similar meaning words, the similar meaning words of the word generated by the GAN model are variant words of the word, and the variant words include one or a combination of the following: removing part of the content of the words, and replacing part or all of the content of the words.
According to a preferred embodiment of the present invention, the replacing of part or all of the content of the words comprises one or any combination of the following: replacing at least one character in the words with pinyin, replacing at least one character in the words with pinyin first letters, and replacing at least one character in the words with other characters with similar pronunciation.
A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method as described above when executing the program.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method as set forth above.
Based on the introduction, the scheme of the invention can find out the near-meaning words of the obtained keywords to be processed from the generated candidate word set by using a word vector tool, and then the keywords and the near-meaning words of the found near-meaning words can be respectively generated by using a pre-trained GAN model, so that a plurality of near-meaning words of the keywords are automatically expanded, and the processing efficiency is improved.
[ description of the drawings ]
FIG. 1 is a flowchart of a method for expanding a synonym according to an embodiment of the present invention.
Fig. 2 is a flowchart of an embodiment of a GAN model training method according to the present invention.
Fig. 3 is a schematic structural diagram of a similar meaning term expansion apparatus according to an embodiment of the present invention.
Fig. 4 is a schematic structural diagram of the GAN model training apparatus according to an embodiment of the present invention.
FIG. 5 illustrates a block diagram of an exemplary computer system/server 12 suitable for use in implementing embodiments of the present invention.
[ detailed description ] embodiments
In order to make the technical solution of the present invention clearer and more obvious, the solution of the present invention is further described below by referring to the drawings and examples.
It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In addition, it should be understood that the term "and/or" herein is merely one type of association relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.
FIG. 1 is a flowchart of a method for expanding a synonym according to an embodiment of the present invention. As shown in fig. 1, the following detailed implementation is included.
In 101, a keyword to be processed is obtained.
At 102, a word vector tool is used to find out the near-meaning words of the keywords to be processed from the generated candidate word set.
In 103, a generated confrontation network (GAN) model obtained by pre-training is used to generate the keywords to be processed and the near-meaning words of the searched near-meaning words, respectively.
The keywords to be processed may refer to keywords summarized by human beings. For each keyword summarized manually, the keyword can be treated as a keyword to be treated in the manner shown in fig. 1.
For a keyword to be processed, a word vector tool may be first utilized to find out a near-meaning word of the keyword from the generated candidate word set. For example, the keyword may be input to a word vector tool, the word vector obtaining tool calculates the distance between the word vector representation of each candidate word and the word vector representation of the keyword, then selects and returns N candidate words closest to the keyword, the returned candidate words are used as the synonyms of the keyword, N is a positive integer, and the specific value may be determined according to actual needs.
For this purpose, UGC data can be collected in advance, word segmentation processing can be carried out on the collected UGC data, and word segmentation results are used as alternative words. For example, typical UGC data such as a bar, a hundred house number, a chinese wikipedia, and the like may be collected, and word segmentation may be performed on the collected UGC data according to a conventional word segmentation manner, so as to obtain word segmentation results, and the obtained word segmentation results may be used as candidate words, or further screening may be performed on the switching results, so as to screen out some word segmentation results that do not meet a predetermined requirement, and use the remaining word segmentation results as candidate words, and the like.
In addition, the word vector may be trained based on each word segmentation result, so that the word vector representation of each candidate word may be obtained separately, the specific dimension of the word vector representation may be determined according to actual needs, such as 200 dimensions or 300 dimensions, and how to train the word vector is the prior art. The word vector representation can better express the similarity and analogy relationship between different words.
In this embodiment, for a keyword to be processed, a word vector tool may be utilized to find a near-meaning word of the keyword from the generated candidate word set. Common Word vector tools may include Word2vec, gloV, fastText, gensim, indra, and Deeplearning4j, among others. After the keyword to be processed is input to the word vector tool, the word vector tool can respectively calculate the distance between the word vector representation of each alternative word and the word vector representation of the keyword, for example, calculate the euclidean distance between the word vector representation of each alternative word and the word vector representation of the keyword, and then return the topN alternative word closest to the keyword as the near-synonym word of the searched keyword.
The searched near-meaning words of the keyword can be further screened, for example, by means of manual screening or rule-based screening, problematic words are screened, for example, words that are obviously not near-meaning words of the keyword are screened.
And then, generating the keywords to be processed and the near meaning words of the searched near meaning words respectively by utilizing a pre-trained GAN model.
The GAN Model is a deep learning Model, which is one of the most promising methods for unsupervised learning in complex distribution in recent years, and the Model generates a quite good output through mutual game learning of two modules in a frame, wherein the two modules are a generation module (genetic Model) and a discrimination module (discrimination Model). In the original GAN theory, it is not required that G (i.e., a generation module) and D (i.e., a discrimination module) are both neural networks, as long as the functions of corresponding generation and discrimination can be fitted, but in practical application, a deep neural network is generally used as G and D.
GAN elicits from two-player zero-sum games (two-player games) in the game theory, with two players in the GAN model being served by G and D, respectively. Assuming that G is a network for generating pictures, which receives random noise z, pictures are generated by the noise, denoted as G (z), D is a discrimination network, which discriminates whether a picture is "real", and its input is x, which represents a picture, and the output D (x) represents the probability that x is a real picture, if 1, a real picture, and if 0, a non-real picture, G aims to generate a real picture as much as possible to D during the training process, and D aims to distinguish the generated picture from the real picture as much as possible, so that G and D constitute a dynamic game process, and the final game result is that G can generate G (z) enough to falsely and truly, and for D, it is difficult to discriminate whether the generated picture by G is real, so D (G (z)) =0.5, so that the training purpose is achieved, and G can be used to generate pictures.
Specifically, in this embodiment, training samples may be obtained, and each training sample may include: the original words and the similar meaning words of the original words can be trained into the GAN model according to the training samples, and the training process is a process for enabling the GAN model to learn how to obtain the similar meaning words of the original words from the original words.
And respectively generating the keywords to be processed and the near meaning words of the searched near meaning words by utilizing the trained GAN model. For example, for a keyword to be processed and each word in the searched near-meaning word, the word and noise may be respectively input into the GAN model, so as to obtain the near-meaning word of the word generated by the GAN model. For the same word, if different noises are respectively input into the GAN model, different synonyms of the word generated by the GAN model can be obtained.
After the GAN model training is completed, only G is needed to be used, D is not needed to be used, and the G generates the similar meaning words of the input words by using the input words and noise. The noise may be gaussian noise, and for a word, the input noise is different, and the generated synonym is also different. How many similar words are generated for each word respectively can be determined according to actual needs.
The synonym of the original word may be a morpheme of the original word, including but not limited to one or a combination of the following: removing part of the content in the original word, replacing part or all of the content in the original word, and the like. Correspondingly, for each word in the keywords to be processed and the found similar meaning words, the similar meaning word of the word generated by the GAN model is also a variant word of the word, including but not limited to one or a combination of the following: removing part of the content in the word, replacing part or all of the content in the word, and the like.
Substitution of some or all of the content of the word may include, but is not limited to, one or any combination of the following: at least one character in the word is replaced by pinyin, at least one character in the word is replaced by pinyin initial, at least one character in the word is replaced by other characters with similar pronunciation, and the like.
In the existing filtering mode based on keywords, it is assumed that a micro-signal advertisement in UGC content is to be filtered, and a used keyword is a "micro-signal", and then approximate words such as "weixin", "wx", "WeChat", and "micro-xin" are easily used to bypass, where "weixin" and "micro-xin" are conditions for replacing at least one word in the "micro-signal" with pinyin, "wx" is a condition for replacing at least one word in the "micro-signal" with a pinyin initial, and "micro-signal" is a condition for removing part of content in the "micro-signal". Assuming that it is desired to filter out political figure names in UGC content, the keywords used are the names of the political figures, then this can be bypassed by replacing at least one word in the name with another word with a similar pronunciation.
In the scheme of this embodiment, the GAN model learns how to obtain the deformed word of the original word based on the deformation of the original word by training the GAN model, so that the deformed word of the input word can be generated when the GAN model is subsequently used for near-meaning word expansion.
It can be seen from the above description that, in this embodiment, for any keyword to be processed, a near-sense word of the keyword may be expanded sequentially through two ways, one way is to expand a relatively common near-sense word by using a word vector tool, on this basis, another way may be adopted, that is, using a GAN model, to further generate the keyword and a near-sense word (e.g., a morpheme) of the found near-sense word, assuming that 3 near-sense words are expanded for the keyword by using the word vector tool, then 4 words are obtained by adding the keyword itself, and for each word in the 4 words, 3 words are further expanded by using the GAN model, then 16 words in total may be obtained, that is, a large number of near-sense words are expanded, so that the expansion capability and expansion efficiency of the near-sense words are greatly improved.
Subsequently, the expanded synonyms and the keywords summarized and summarized manually can be used as the keywords for auditing and filtering, and the UGC content uploaded by the user can be audited and filtered.
Fig. 2 is a flowchart of an embodiment of a GAN model training method according to the present invention. As shown in fig. 2, the following detailed implementation is included.
In 201, training samples are obtained, each training sample including: the original word and a synonym of the original word.
At 202, a GAN model is trained according to the training samples, so that when a word is extended, for a keyword to be processed, after a word vector tool is used to find out a word close to the keyword from the generated candidate word set, the GAN model is used to generate the keyword and a word close to the found word close to the keyword respectively.
Wherein, the synonym of the original word can be a modified word of the original word, and can include but not limited to one or a combination of the following: removing part of the content in the original word, replacing part or all of the content in the original word, and the like. Accordingly, for each word in the keyword to be processed and the found near sense word, the near sense word of the word generated by the GAN model may be a morpheme of the word, and may include, but is not limited to, one or a combination of the following: removing part of the content in the word, replacing part or all of the content in the word, and the like.
Substitution of some or all of the content of the word may include, but is not limited to, one or any combination of the following: at least one character in the word is replaced by pinyin, at least one character in the word is replaced by pinyin initial, at least one character in the word is replaced by other characters with similar pronunciation, and the like.
For example, the original word is "Weixin", and "weixin", "wx", "Wenxin", or "micro xin" may be used as a modified word of the original word, and based on the training sample, a GAN model is trained, that is, the GAN model learns how to obtain a modified word of the original word based on the original word modification, so that a modified word of an input word may be generated when a similar meaning word is subsequently expanded by using the GAN model.
It should be noted that, for simplicity of description, the foregoing method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art will appreciate that the embodiments described in this specification are presently preferred and that no acts or modules are required by the invention.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In short, by adopting the scheme of the embodiment of the method, a plurality of synonyms of the keywords can be automatically expanded, so that the processing efficiency is improved, and the like.
The above is a description of method embodiments, and the embodiments of the present invention are further described below by way of apparatus embodiments.
Fig. 3 is a schematic structural diagram of a similar meaning term expansion apparatus according to an embodiment of the present invention. As shown in fig. 3, includes: a first extension unit 301 and a second extension unit 302.
The first expansion unit 301 is configured to obtain a keyword to be processed, and find a near-meaning word of the keyword from the generated candidate word set by using a word vector tool.
A second expanding unit 302, configured to generate, by using a GAN model obtained through pre-training, near-synonyms of the keyword and the found near-synonyms, respectively.
The keywords to be processed may refer to keywords summarized manually. And aiming at each keyword summarized and concluded manually, the keyword can be respectively used as a keyword to be processed according to the mode.
For the keyword to be processed, the first expansion unit 301 may first find a near-meaning word of the keyword from the generated candidate word set by using a word vector tool. For example, the keyword may be input to a word vector tool, the word vector obtaining tool calculates distances between word vector representations of the candidate words and word vector representations of the keyword, and then selects and returns N candidate words closest to the keyword, where the returned candidate words are used as synonyms of the keyword, and N is a positive integer.
For this, the first expansion unit 301 may collect beforehand collectable UGC data in advance, and perform word segmentation processing on the collected UGC data, taking the word segmentation result as an alternative word. For example, typical UGC data such as a bar, a hundred house number, a chinese wikipedia, and the like may be collected, and word segmentation may be performed on the collected UGC data according to an existing word segmentation manner, so as to obtain word segmentation results, and the obtained word segmentation results may be used as candidate words.
Further, the second expanding unit 302 may generate the keyword and the similar meaning word of the found similar meaning word respectively by using the GAN model. For example, the second expanding unit 302 may input the word and noise into the GAN model for each word in the keyword and the searched similar meaning words, so as to obtain the similar meaning words of the word generated by the GAN model. And inputting different noises into the GAN model aiming at the same word respectively to obtain different similar meaning words of the word generated by the GAN model.
Fig. 4 is a schematic structural diagram of the GAN model training apparatus according to an embodiment of the present invention. As shown in fig. 4, includes: a sample acquisition unit 401 and a model training unit 402.
A sample acquiring unit 401, configured to acquire training samples, where each training sample may include: the original word and a synonym of the original word.
A model training unit 402, configured to train a GAN model according to a training sample, so that when a near-sense word is expanded, for a keyword to be processed, after a word vector tool is used to find a near-sense word of the keyword from a generated candidate word set, the GAN model is used to generate the keyword and a near-sense word of the found near-sense word respectively.
Wherein, the synonym of the original word can be a modified word of the original word, and can include but not limited to one or a combination of the following: removing part of the content in the original word, replacing part or all of the content in the original word, and the like. Accordingly, for each of the keyword and the found near sense words, the near sense word of the word generated by the GAN model may be a morpheme of the word, which may include but is not limited to one or a combination of the following: removing part of the content in the word, replacing part or all of the content in the word, and the like.
Substitution of some or all of the content of the word may include, but is not limited to, one or any combination of the following: at least one character in the word is replaced by pinyin, at least one character in the word is replaced by pinyin initial, at least one character in the word is replaced by other characters with similar pronunciation, and the like.
For example, the original word is "Weixin", and "weixin", "wx", "Wenxin", or "micro xin" may be used as a modified word of the original word, and based on the training sample, a GAN model is trained, that is, the GAN model learns how to obtain a modified word of the original word based on the original word modification, so that a modified word of an input word may be generated when a similar meaning word is subsequently expanded by using the GAN model.
For a specific work flow of the apparatus embodiments shown in fig. 3 and fig. 4, reference is made to the related description in the foregoing method embodiments, and details are not repeated. In practical applications, the apparatuses shown in fig. 3 and 4 can be independent apparatuses respectively, or can be combined into one apparatus.
FIG. 5 illustrates a block diagram of an exemplary computer system/server 12 suitable for use in implementing embodiments of the present invention. The computer system/server 12 shown in FIG. 5 is only one example and should not be taken to limit the scope of use or functionality of embodiments of the present invention.
As shown in fig. 5, computer system/server 12 is in the form of a general purpose computing device. The components of computer system/server 12 may include, but are not limited to: one or more processors (processing units) 16, a memory 28, and a bus 18 that connects the various system components, including the memory 28 and the processors 16.
Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Computer system/server 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer system/server 12 and includes both volatile and nonvolatile media, removable and non-removable media.
The memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 30 and/or cache memory 32. The computer system/server 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 5 and commonly referred to as a "hard drive"). Although not shown in FIG. 5, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.
A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally carry out the functions and/or methodologies of the described embodiments of the invention.
Computer system/server 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), one or more devices that enable a user to interact with computer system/server 12, and/or any device (e.g., network card, modem, etc.) that enables computer system/server 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Also, the computer system/server 12 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet) via the network adapter 20. As shown in FIG. 5, network adapter 20 communicates with the other modules of computer system/server 12 via bus 18. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the computer system/server 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
The processor 16 executes various functional applications and data processing by executing programs stored in the memory 28, for example, implementing the methods in the embodiments shown in fig. 1 or fig. 2.
The invention also discloses a computer-readable storage medium on which a computer program is stored, which program, when being executed by a processor, will carry out the method of the embodiments shown in fig. 1 or fig. 2.
Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method, etc., can be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and other divisions may be realized in practice.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and should not be taken as limiting the scope of the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (14)

1. A method for extending a synonym, comprising:
acquiring keywords to be processed;
finding out the similar meaning words of the keywords from the generated candidate word set by using a word vector tool;
respectively generating the keywords and the similar meaning words of the searched similar meaning words by utilizing a generated confrontation network GAN model obtained by pre-training, wherein the generation comprises the following steps: aiming at each word in the keywords and the searched similar meaning words, respectively inputting the word and noise into the GAN model to obtain the similar meaning words of the word generated by the GAN model;
the method further comprises the following steps: and aiming at the same word, respectively inputting different noises into the GAN model to obtain different similar words of the word generated by the GAN model.
2. The method of claim 1,
the finding out the near-meaning words of the keywords from the generated candidate word set by using a word vector tool comprises the following steps:
inputting the keywords to the word vector tool, obtaining the word vector tool, and calculating the distance between the word vector representation of each alternative word and the word vector representation of the keywords respectively, then selecting and returning N alternative words closest to the keywords, and taking the returned alternative words as the synonyms of the keywords, wherein N is a positive integer.
3. The method of claim 1,
the mode for generating the alternative word set comprises the following steps:
collecting user generated content UGC data;
and carrying out word segmentation on the UGC data, and taking word segmentation results as alternative words.
4. A training method for generating an antagonistic network model, comprising:
obtaining training samples, each training sample comprising: an original word and a near-meaning word of the original word;
training a generated countermeasure network GAN model according to the training sample so as to search out the near meaning words of the keywords from the generated candidate word set by using a word vector tool aiming at the keywords to be processed when the near meaning words are expanded, and then respectively generating the keywords and the near meaning words of the searched near meaning words by using the GAN model.
5. The method of claim 4,
the similar meaning word of the original word is a deformed word of the original word, and comprises one or a combination of the following words: removing part of the content in the original word, and replacing part or all of the content in the original word;
for each word in the keyword and the searched similar meaning words, the similar meaning words of the word generated by the GAN model are morphemes of the word, and include one or a combination of the following: removing part of the content of the words, and replacing part or all of the content of the words.
6. The method of claim 5,
the replacement of part or all of the content of the words comprises one or any combination of the following: replacing at least one character in the words with pinyin, replacing at least one character in the words with pinyin first letters, and replacing at least one character in the words with other characters with similar pronunciation.
7. A synonym expansion device, comprising: a first extension unit and a second extension unit;
the first expansion unit is used for acquiring keywords to be processed and finding out near-meaning words of the keywords from the generated alternative word set by using a word vector tool;
the second expansion unit is used for respectively generating the keywords and the similar meaning words of the searched similar meaning words by utilizing a generated confrontation network GAN model obtained by pre-training;
the second expansion unit inputs the word and noise into the GAN model respectively for each word in the keyword and the searched near-meaning words to obtain the near-meaning words of the word generated by the GAN model;
the second expansion unit is further configured to, for a same word, input different noises to the GAN model, respectively, to obtain different synonym words of the word generated by the GAN model.
8. The apparatus of claim 7,
the first expansion unit inputs the keywords to the word vector tool, the word vector tool is obtained to respectively calculate the distance between the word vector representation of each alternative word and the word vector representation of the keywords, N alternative words which are closest to the keywords are selected and returned, the returned alternative words are used as the synonyms of the keywords, and N is a positive integer.
9. The apparatus of claim 7,
the first expansion unit is further used for collecting user generated content UGC data, carrying out word segmentation on the UGC data, and taking word segmentation results as alternative words.
10. A training apparatus for generating an antagonistic network model, comprising: a sample obtaining unit and a model training unit;
the sample obtaining unit is configured to obtain training samples, where each training sample includes: an original word and a near-meaning word of the original word;
the model training unit is used for training a generated confrontation network GAN model according to the training sample so as to respectively generate the keyword and the near meaning words of the searched near meaning words by using the GAN model after searching the near meaning words of the keyword from the generated candidate word set by using a word vector tool for the keyword to be processed during the near meaning word expansion.
11. The apparatus of claim 10,
the similar meaning word of the original word is a deformed word of the original word, and comprises one or a combination of the following words: removing part of the content in the original word, and replacing part or all of the content in the original word;
for each word in the keyword and the searched similar meaning words, the similar meaning words of the word generated by the GAN model are variant words of the word, and the variant words include one or a combination of the following: removing part of the content of the words, and replacing part or all of the content of the words.
12. The apparatus of claim 11,
the replacement of part or all of the content of the words comprises one or any combination of the following: replacing at least one character in the words with pinyin, replacing at least one character in the words with pinyin first letters, and replacing at least one character in the words with other characters with similar pronunciation.
13. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 6 when executing the program.
14. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 6.
CN201910204138.6A 2019-03-18 2019-03-18 Training method and device for similar meaning word expansion and generation of confrontation network model Active CN110032734B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910204138.6A CN110032734B (en) 2019-03-18 2019-03-18 Training method and device for similar meaning word expansion and generation of confrontation network model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910204138.6A CN110032734B (en) 2019-03-18 2019-03-18 Training method and device for similar meaning word expansion and generation of confrontation network model

Publications (2)

Publication Number Publication Date
CN110032734A CN110032734A (en) 2019-07-19
CN110032734B true CN110032734B (en) 2023-02-28

Family

ID=67236088

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910204138.6A Active CN110032734B (en) 2019-03-18 2019-03-18 Training method and device for similar meaning word expansion and generation of confrontation network model

Country Status (1)

Country Link
CN (1) CN110032734B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111177351A (en) * 2019-12-20 2020-05-19 北京淇瑀信息科技有限公司 Method, device and system for acquiring natural language expression intention based on rule
CN111291563B (en) * 2020-01-20 2023-09-01 腾讯科技(深圳)有限公司 Word vector alignment method and word vector alignment model training method
CN116010609B (en) * 2023-03-23 2023-06-09 山东中翰软件有限公司 Material data classifying method and device, electronic equipment and storage medium
CN116340470B (en) * 2023-05-30 2023-09-15 环球数科集团有限公司 Keyword associated retrieval system based on AIGC

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103744956A (en) * 2014-01-06 2014-04-23 同济大学 Diversified expansion method of keyword
CN106294396A (en) * 2015-05-20 2017-01-04 北京大学 Keyword expansion method and keyword expansion system
CN107220232A (en) * 2017-04-06 2017-09-29 北京百度网讯科技有限公司 Keyword extracting method and device, equipment and computer-readable recording medium based on artificial intelligence
CN107862015A (en) * 2017-10-30 2018-03-30 北京奇艺世纪科技有限公司 A kind of crucial word association extended method and device
CN107957990A (en) * 2017-11-20 2018-04-24 东软集团股份有限公司 A kind of trigger word extended method, device and Event Distillation method and system
CN108255985A (en) * 2017-12-28 2018-07-06 东软集团股份有限公司 Data directory construction method, search method and device, medium and electronic equipment
CN108509474A (en) * 2017-09-15 2018-09-07 腾讯科技(深圳)有限公司 Search for the synonym extended method and device of information
CN108733647A (en) * 2018-04-13 2018-11-02 中山大学 A kind of term vector generation method based on Gaussian Profile
CN109299342A (en) * 2018-11-30 2019-02-01 武汉大学 A kind of cross-module state search method based on circulation production confrontation network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10839285B2 (en) * 2017-04-10 2020-11-17 International Business Machines Corporation Local abbreviation expansion through context correlation

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103744956A (en) * 2014-01-06 2014-04-23 同济大学 Diversified expansion method of keyword
CN106294396A (en) * 2015-05-20 2017-01-04 北京大学 Keyword expansion method and keyword expansion system
CN107220232A (en) * 2017-04-06 2017-09-29 北京百度网讯科技有限公司 Keyword extracting method and device, equipment and computer-readable recording medium based on artificial intelligence
CN108509474A (en) * 2017-09-15 2018-09-07 腾讯科技(深圳)有限公司 Search for the synonym extended method and device of information
CN107862015A (en) * 2017-10-30 2018-03-30 北京奇艺世纪科技有限公司 A kind of crucial word association extended method and device
CN107957990A (en) * 2017-11-20 2018-04-24 东软集团股份有限公司 A kind of trigger word extended method, device and Event Distillation method and system
CN108255985A (en) * 2017-12-28 2018-07-06 东软集团股份有限公司 Data directory construction method, search method and device, medium and electronic equipment
CN108733647A (en) * 2018-04-13 2018-11-02 中山大学 A kind of term vector generation method based on Gaussian Profile
CN109299342A (en) * 2018-11-30 2019-02-01 武汉大学 A kind of cross-module state search method based on circulation production confrontation network

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
Nasim Souly ; .Semi Supervised Semantic Segmentation Using Generative Adversarial Network.《2017 IEEE International Conference on Computer Vision (ICCV)》.2017, *
基于主机与云分析结合的轻量级威胁感知***;彭国军等;《华中科技大学学报(自然科学版)》;20160323(第03期);第22-26、32页 *
基于生成对抗学习的知识图谱问答***研究;李一夫;《中国优秀硕士学位论文全文数据库 (信息科技辑)》;20190115(第1期);I138-5571 *
基于词或词组长度和频数的短中文文本关键词提取算法;陈伟鹤等;《计算机科学》;20161215(第12期);第57-64页 *
生成式对抗网络:从生成数据到创造智能;王坤峰等;《自动化学报》;20180515(第05期);第4-9页 *
生成式对抗网络研究进展;王万良等;《通信学报》;20180225(第02期);第139-152页 *
面向微博短文本的社交与概念化语义扩展搜索方法;崔婉秋等;《计算机研究与发展》;20180815(第08期);第47-58页 *

Also Published As

Publication number Publication date
CN110032734A (en) 2019-07-19

Similar Documents

Publication Publication Date Title
CN106940788B (en) Intelligent scoring method and device, computer equipment and computer readable medium
CN110032734B (en) Training method and device for similar meaning word expansion and generation of confrontation network model
CN106649818B (en) Application search intention identification method and device, application search method and server
CN107220232B (en) Keyword extraction method and device based on artificial intelligence, equipment and readable medium
CN107193973B (en) Method, device and equipment for identifying field of semantic analysis information and readable medium
US11372942B2 (en) Method, apparatus, computer device and storage medium for verifying community question answer data
CN110276023B (en) POI transition event discovery method, device, computing equipment and medium
CN110569335B (en) Triple verification method and device based on artificial intelligence and storage medium
US9633008B1 (en) Cognitive presentation advisor
CN111324771B (en) Video tag determination method and device, electronic equipment and storage medium
CN110597978B (en) Article abstract generation method, system, electronic equipment and readable storage medium
US20160117954A1 (en) System and method for automated teaching of languages based on frequency of syntactic models
CN110377750B (en) Comment generation method, comment generation device, comment generation model training device and storage medium
CN107861948B (en) Label extraction method, device, equipment and medium
CN107948730B (en) Method, device and equipment for generating video based on picture and storage medium
CN107844531B (en) Answer output method and device and computer equipment
CN114141384A (en) Method, apparatus and medium for retrieving medical data
CN116402166B (en) Training method and device of prediction model, electronic equipment and storage medium
EP4060526A1 (en) Text processing method and device
CN112559711A (en) Synonymous text prompting method and device and electronic equipment
CN114842982B (en) Knowledge expression method, device and system for medical information system
US11709872B2 (en) Computer-readable recording medium storing response processing program, response processing method, and information processing apparatus
CN110059180B (en) Article author identity recognition and evaluation model training method and device and storage medium
CN112183117B (en) Translation evaluation method and device, storage medium and electronic equipment
CN110378378B (en) Event retrieval method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant