CN111859917A - Topic model construction method and device and computer readable storage medium - Google Patents

Topic model construction method and device and computer readable storage medium Download PDF

Info

Publication number
CN111859917A
CN111859917A CN202010752688.4A CN202010752688A CN111859917A CN 111859917 A CN111859917 A CN 111859917A CN 202010752688 A CN202010752688 A CN 202010752688A CN 111859917 A CN111859917 A CN 111859917A
Authority
CN
China
Prior art keywords
topic
local
global
word
distribution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010752688.4A
Other languages
Chinese (zh)
Inventor
姜迪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WeBank Co Ltd
Original Assignee
WeBank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WeBank Co Ltd filed Critical WeBank Co Ltd
Priority to CN202010752688.4A priority Critical patent/CN111859917A/en
Publication of CN111859917A publication Critical patent/CN111859917A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method, equipment and a computer readable storage medium for constructing a topic model, wherein the method comprises the following steps: performing local topic model construction on a local corpus to obtain local topic word distribution corresponding to the local corpus; and sending the local subject term distribution to a coordination terminal, so that the coordination terminal combines the local subject term distributions received from the data terminals to obtain global subject term distribution, and obtaining a global subject model according to the global subject term distribution matrix. The method and the device realize the protection of the privacy information in the corpus of each data end while improving the quality of the topic model.

Description

Topic model construction method and device and computer readable storage medium
Technical Field
The invention relates to the field of artificial intelligence, in particular to a method and equipment for constructing a topic model and a computer-readable storage medium.
Background
A Topic Model (Topic Model) is a statistical Model used to find abstract topics in a series of documents in the fields of machine learning and natural language processing. Implicit semantic discovery tools, represented by topic models, have been widely used in a wide variety of applications. However, in practical application, the text information is often distributed in different machines, and because the information has privacy, the text information cannot be obtained for the topic model construction, so that the number of the text information which can be used for constructing the topic model cannot meet the requirement, and the quality of the constructed topic model is not high.
Disclosure of Invention
The invention mainly aims to provide a topic model construction method, equipment and a computer readable storage medium, aiming at solving the problems that text information with privacy is distributed in different machines, and the text information cannot be obtained for topic model construction, so that the data volume of the text information for constructing a topic model is insufficient, and the quality of the topic model is not high.
In order to achieve the above object, the present invention provides a method for constructing a topic model, wherein the method is applied to a data end, and the method comprises the following steps:
performing local topic model construction on a local corpus to obtain local topic word distribution corresponding to the local corpus;
and sending the local subject term distribution to a coordination terminal, so that the coordination terminal combines the local subject term distributions received from the data terminals to obtain global subject term distribution, and obtaining a global subject model according to the global subject term distribution matrix.
Optionally, the step of constructing a local topic model for the corpus at the local end to obtain local topic word distribution corresponding to the corpus includes:
carrying out differential privacy processing on the corpus of the local end to obtain a processed corpus;
and performing local topic model construction on the processed corpus to obtain local topic word distribution corresponding to the processed corpus.
Optionally, the topic model includes topic word distribution, and the step of constructing the local topic model for the local corpus to obtain the local topic word distribution corresponding to the local corpus includes:
sampling according to a preset sampling algorithm to obtain the theme of each word of each document in the local corpus;
and obtaining the local-end subject word distribution corresponding to the corpus based on the subject statistics of each word in each document.
Optionally, after the step of sending the local topic word distributions to a coordination end to combine the local topic word distributions received from the data ends by the coordination end to obtain a global topic word distribution, and obtaining a global topic model according to the global topic word distribution matrix, the method further includes:
receiving the global topic model sent by the coordination terminal;
sampling according to a preset sampling algorithm based on the global topic model to obtain the topic of each word in the document to be processed;
and obtaining the theme distribution of the document to be processed based on the theme of each word in the document to be processed.
In order to achieve the above object, the present invention provides a method for constructing a topic model, which is applied to a coordination end, and the method includes the following steps:
receiving local subject term distribution sent by each data end, wherein each data end carries out local subject model construction on each corpus to obtain local subject term distribution corresponding to the corpus;
and merging the local subject term distributions to obtain global subject term distributions, and obtaining a global subject model according to the global subject term distributions.
Optionally, the step of merging the local topic word distributions to obtain a global topic word distribution includes:
merging the local subject term distributions to obtain a total subject term distribution;
calculating the similarity between every two topics according to the word distribution corresponding to each topic in the total topic word distribution, and acquiring a topic pair with the similarity larger than a preset threshold value based on the calculation result;
and combining the word distributions corresponding to the two themes in the theme pair to update the total theme distribution, and taking the updated total theme word distribution as the global theme word distribution.
Optionally, the step of obtaining a global topic model according to the global topic word distribution includes:
detecting whether the cycle number of the current global topic model construction with each data terminal reaches a preset number;
if the cycle times are detected not to reach the preset times, sending the global subject word distribution to each data end, so that each data end can execute local subject word model construction on each corpus based on the global subject word distribution to obtain the local subject word distribution corresponding to the corpus, and performing the next global subject word model construction;
and if the cycle times reach the preset times, taking the global topic word distribution as a global topic model.
In order to achieve the above object, the present invention further provides a topic model construction device, including: a memory, a processor and a topic model builder stored on the memory and executable on the processor, the topic model builder when executed by the processor implementing the steps of the topic model building method as described above.
Further, to achieve the above object, the present invention also proposes a computer-readable storage medium having stored thereon a topic model construction program which, when executed by a processor, implements the steps of the topic model construction method as described above.
According to the method, the local topic model is built on the local corpus of the local end through the data end, the local topic word distribution corresponding to the local corpus is obtained, the local topic word distribution is sent to the coordination end, the local topic word distribution sent by the coordination end to each data end is merged, the global topic word distribution is obtained, and then the global topic model is obtained. Moreover, each data end does not directly send the corpus to the coordination end, but distributes and sends the subject words to the coordination party, and text information in the corpus cannot be directly exposed, so that the quality of the subject model is improved, and the privacy information in the corpus of each data end is protected.
Drawings
FIG. 1 is a schematic diagram of a hardware operating environment according to an embodiment of the present invention;
fig. 2 is a schematic flow chart of a first embodiment of the topic model construction method of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As shown in fig. 1, fig. 1 is a schematic device structure diagram of a hardware operating environment according to an embodiment of the present invention.
It should be noted that, the topic model building device in the embodiment of the present invention may be a smart phone, a personal computer, a server, and the like, and is not limited herein.
As shown in fig. 1, the topic model construction apparatus may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.
Those skilled in the art will appreciate that the configuration of the apparatus shown in FIG. 1 does not constitute a limitation of the subject model building apparatus and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.
As shown in fig. 1, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a topic model building program. The operating system is a program for managing and controlling hardware and software resources of the device, and supports the operation of the theme model building program and other software or programs.
When the topic model building device is a data side participating in global topic model building, in the device shown in fig. 1, the user interface 1003 is mainly used for data communication with a client side; the network interface 1004 is mainly used for establishing communication connection with a coordinating end participating in the global topic model construction; the processor 1001 may be configured to invoke the topic model builder stored in the memory 1005 and perform the following operations:
performing local topic model construction on a local corpus to obtain local topic word distribution corresponding to the local corpus;
and sending the local subject term distribution to a coordination terminal, so that the coordination terminal combines the local subject term distributions received from the data terminals to obtain global subject term distribution, and obtaining a global subject model according to the global subject term distribution matrix.
Further, the step of constructing a local topic model for the corpus of the local end to obtain the local topic word distribution corresponding to the corpus includes:
carrying out differential privacy processing on the corpus of the local end to obtain a processed corpus;
and performing local topic model construction on the processed corpus to obtain local topic word distribution corresponding to the processed corpus.
Further, the topic model includes topic word distribution, and the step of constructing the local topic model for the local corpus to obtain the local topic word distribution corresponding to the local corpus includes:
sampling according to a preset sampling algorithm to obtain the theme of each word of each document in the local corpus;
and obtaining the local-end subject word distribution corresponding to the corpus based on the subject statistics of each word in each document.
Further, after the step of sending the local topic word distributions to the coordination end to allow the coordination end to merge the local topic word distributions received from the data ends to obtain a global topic word distribution, and obtain a global topic model according to the global topic word distribution matrix, the processor 1001 may be configured to invoke a topic model building program stored in the memory 1005, and further perform the following operations:
receiving the global topic model sent by the coordination terminal;
sampling according to a preset sampling algorithm based on the global topic model to obtain the topic of each word in the document to be processed;
and obtaining the theme distribution of the document to be processed based on the theme of each word in the document to be processed.
When the topic model building device is a coordination end participating in global topic model building, in the device shown in fig. 1, the user interface 1003 is mainly used for data communication with a client; the network interface 1004 is mainly used for establishing communication connection with a data end participating in global topic model construction; the processor 1001 may be configured to invoke the topic model builder stored in the memory 1005 and perform the following operations:
receiving local subject term distribution sent by each data end, wherein each data end carries out local subject model construction on each corpus to obtain local subject term distribution corresponding to the corpus;
and merging the local subject term distributions to obtain global subject term distributions, and obtaining a global subject model according to the global subject term distributions.
Further, the step of merging the local subject term distributions to obtain a global subject term distribution includes:
merging the local subject term distributions to obtain a total subject term distribution;
calculating the similarity between every two topics according to the word distribution corresponding to each topic in the total topic word distribution, and acquiring a topic pair with the similarity larger than a preset threshold value based on the calculation result;
and combining the word distributions corresponding to the two themes in the theme pair to update the total theme distribution, and taking the updated total theme word distribution as the global theme word distribution.
Further, the step of obtaining a global topic model according to the global topic word distribution includes:
detecting whether the cycle number of the current global topic model construction with each data terminal reaches a preset number;
if the cycle times are detected not to reach the preset times, sending the global subject word distribution to each data end, so that each data end can execute local subject word model construction on each corpus based on the global subject word distribution to obtain the local subject word distribution corresponding to the corpus, and performing the next global subject word model construction;
and if the cycle times reach the preset times, taking the global topic word distribution as a global topic model.
Based on the above structure, embodiments of the topic model construction method are provided.
Referring to fig. 2, fig. 2 is a schematic flow chart of a first embodiment of the topic model construction method of the present invention. It should be noted that, although a logical order is shown in the flow chart, in some cases, the steps shown or described may be performed in an order different than that shown or described herein.
In this embodiment, the theme model construction method is applied to a data side, the data side is in communication connection with a coordination side, and both the data side and the coordination side can be devices such as a smart phone, a personal computer, a server and the like. In this embodiment, the topic model construction method includes:
step S10, local topic model construction is carried out on the corpus of the local end, and local topic word distribution corresponding to the corpus is obtained;
in this embodiment, several data terminals may jointly perform topic model construction, each data terminal maintains one or more corpora, and each data terminal uses the corpus of its own home terminal to participate in the construction of the global topic model. A corpus may include a plurality of pieces of text data, which may be an article or a section of speech, etc. (hereinafter collectively referred to as "documents"). Specifically, each data terminal may establish a communication connection with the coordination terminal, and may be one of the data terminals to execute the function of the coordination terminal, or may be a third party serving as the coordination terminal.
Since the operations executed by the data terminals in the process of jointly performing the topic model construction are the same, the following description will take one data terminal as an example.
The data end can adopt the corpus of the local end to construct a local topic model, and local topic word distribution (also called topic-word distribution) corresponding to the corpus is obtained. Specifically, the local topic model construction mode of the data end on the corpus at the local end may adopt an existing topic model construction mode, for example, a topic model construction mode based on LDA (Latent Dirichlet Allocation) or a topic model construction mode based on LSA (Latent semantic analysis) and the like. And the data side builds a local topic model to obtain a topic model building result of the corpus, the topic model building result at least comprises document topic distribution and topic word distribution, and the topic word distribution is extracted from the topic model building result and is used as the local topic word distribution corresponding to the data side corpus. Before the theme model is constructed, the number of themes to be divided, document theme distribution, namely a document-theme probability matrix, and theme word distribution, namely a theme-word probability matrix are preset.
Further, the data terminal may pre-process each document in the corpus before constructing the topic model of the corpus. Specifically, operations such as low-frequency word removal, word stop removal, punctuation removal, word segmentation and the like can be removed from the text in each document; the low-frequency words are words with the frequency of appearance in the document being less than the preset lowest frequency; stop words refer to words that appear more frequently in the document but have no effect on the content, such as words "of", "a", "thus", etc.; the method can carry out word stop and punctuation removal operation on the document according to a preset word stop table and a symbol table; the word segmentation is to segment a text sequence into a single word, and the word segmentation can be performed by adopting a maximum matching method, a reverse maximum matching method or a word-by-word traversal method, and the like, which are not described in detail herein.
Step S20, sending the local subject term distribution to a coordination end, so that the coordination end merges the local subject term distributions received from the data ends to obtain a global subject term distribution, and obtains a global subject model according to the global subject term distribution matrix.
And after the local subject term distribution is obtained, sending the local subject term distribution to a coordination terminal. The coordination end receives the local subject term distribution sent by each data end, that is, each data end sends the local subject term distribution obtained by local construction to the coordination end. And the coordination end merges the local subject term distributions to obtain the global subject term distribution. Specifically, the coordination terminal may directly splice and merge the local subject term distributions, and the merged subject term distribution is used as a global subject term distribution, for example, there are 2 data terminals, and the local subject term distribution of each data terminal includes 100 subjects and a term distribution corresponding to each subject, so that a total subject term distribution including 200 subjects and a term distribution corresponding to each subject is obtained by direct splicing, and the total subject term distribution is used as a global subject term distribution. After the coordination terminal obtains the global subject term distribution, the global subject term distribution can be directly used as a global subject model.
The coordination terminal can also send the global subject term to each data terminal, each data terminal carries out local subject model construction again on the basis of the global subject term distribution to obtain new local subject term distribution, and then sends the new local subject term distribution to the coordination terminal; the coordination terminal merges the new local subject term distribution to obtain a new global subject term distribution, and then issues the global subject term distribution again; and when the coordination terminal detects that the cycle number reaches the preset cycle number, taking the latest global topic word distribution as a global topic model.
In this embodiment, a local topic model is constructed on a local corpus by a data end to obtain local topic word distribution corresponding to the local corpus, the local topic word distribution is sent to a coordination end, the local topic word distribution sent by the coordination end is combined with the local topic word distribution sent by each data end to obtain global topic word distribution, and then a global topic model is obtained. Moreover, each data end does not directly send the corpus to the coordination end, but distributes and sends the subject words to the coordination party, and text information in the corpus cannot be directly exposed, so that the quality of the subject model is improved, and the privacy information in the corpus of each data end is protected.
Further, based on the first embodiment, a second embodiment of the method for constructing a topic model of the present invention is provided, and in this embodiment, the step S10 includes:
step S101, performing differential privacy processing on a corpus at a local end to obtain a processed corpus;
further, in this embodiment, the data end may first perform the difference privacy processing on the corpus of the data end to obtain a processed corpus. Specifically, the differential privacy processing may be implemented by adding noise to each document in the corpus by using an existing differential privacy processing method. Differential privacy is a mathematical technique that can always calculate the degree of privacy enhancement while adding noise to the data, thus making the process of adding "noise" more rigorous.
And step S102, performing local topic model construction on the processed corpus to obtain local topic word distribution corresponding to the processed corpus.
And after the data end obtains the processed corpus, constructing a topic model for the processed corpus. Specifically, the data end performs topic model construction on each document with noise in the corpus, and similarly, an existing topic model construction method may be adopted, which is not described in detail herein. Because the data end constructs the topic model for the corpus with noise and the obtained local subject word distribution is also noisy, after the local subject word distribution is sent to the coordination end, the coordination end cannot deduce the corpus content of each data end based on the local subject word distribution of each data end, and therefore the safety of the text information of the data end is further improved.
Further, in an embodiment, the step S10 includes:
step S103, sampling according to a preset sampling algorithm to obtain the theme of each word of each document in the corpus at the home terminal;
further, in this embodiment, the topic model includes topic word distribution, and constructing the topic model for the corpus is to obtain the topic word distribution of the corpus. Specifically, the data end may sample according to a preset sampling algorithm to obtain the topics of the words of each document in the corpus of the local end. The preset sampling algorithm may be a preset sampling algorithm, for example, a Gibbs sampling algorithm, or other sampling algorithms that may sample topics of respective words from respective documents. The data end samples the theme of each word according to the Gibbs sampling algorithm, and the specific steps can be as follows: the method comprises the steps that a corpus of a data end comprises a plurality of documents, each document comprises a plurality of words, the data end initially randomly endows each word of each document with a theme, and information of the words and the themes is counted to obtain the number of the words belonging to each theme in each document and the occurrence frequency of each word in each theme; calculating to obtain the topic probability of each word according to the two statistical results, wherein the specific calculation mode can refer to the construction mode of the LDA topic model, and detailed description is not repeated herein; according to the topic probability of each word, sampling a new topic number to endow the word; counting the information of the words and the topics again to obtain two statistical results, calculating the probability of the topics, and sampling new topics; and circularly iterating until convergence is detected, and obtaining the theme of each word of each document in the corpus. Wherein, a maximum number of times may be preset, and when the number of cycles reaches the maximum number of times, the data end determines convergence.
And step S104, obtaining the local topic word distribution corresponding to the corpus based on the topic statistics of each word in each document.
And the data end obtains the topic-word distribution corresponding to the corpus according to the topic statistics of each word in each document, and the topic-word distribution is used as the local topic word distribution. Specifically, the data end counts words corresponding to each topic and the number of times each word appears in the topic, calculates the probability corresponding to the word based on the number of times the word appears in the topic, and uses the word corresponding to each topic and the probability of the word as topic-word distribution.
Further, in an embodiment, the data end may perform a preprocessing operation on each document in the corpus to obtain a preprocessed corpus; carrying out differential privacy processing on the preprocessed corpus, carrying out local topic model construction on the corpus subjected to the differential privacy processing to obtain local topic word distribution, and sending the local topic word distribution to a coordination terminal; and the coordination terminal merges the local subject term distributions sent by the data terminals to obtain global subject term distribution, and a global subject model is obtained according to the global subject term distribution.
Further, in an embodiment, after the step S20, the method further includes:
step S30, receiving the global topic model sent by the coordinating terminal;
step S40, sampling according to a preset sampling algorithm based on the global topic model to obtain the topic of each word in the document to be processed;
step S50, the topic distribution of the document to be processed is obtained based on the topic of each word in the document to be processed.
After the coordination terminal obtains the global topic model, the coordination terminal can send the global topic model to each data terminal; after receiving the global topic model, the data end can obtain the topic of each word in the document to be processed by sampling according to a preset sampling algorithm based on the global topic model. Wherein the preset sampling algorithm may be a gibbs sampling algorithm. Specifically, the global topic model comprises global topic word distribution, and the data end keeps the global topic word distribution unchanged; initially randomly endowing each word in the document to be processed with a theme, and counting information of the words and the theme to obtain the number of the words belonging to each theme in the document to be processed; calculating to obtain the topic probability of each word according to the statistical result and the global topic word distribution, wherein the specific calculation mode can refer to the construction mode of the LDA topic model and is not described in detail herein; according to the topic probability of each word, sampling a new topic number to endow the word; counting the information of the words and the topics again to obtain the counting result, calculating the probability of the topics, and sampling new topics; and circularly iterating until convergence is detected, and obtaining the theme of each word in the document to be processed. Wherein, a maximum number of times may be preset, and when the number of cycles reaches the maximum number of times, the data end determines convergence. The data end obtains the topic distribution of the document to be processed according to the topic of each word in the document to be processed, specifically, the number of words corresponding to the same topic can be counted, the probability corresponding to the topic is calculated based on the number, and the probability of each topic is used as the topic distribution of the document to be processed.
In this embodiment, the global topic model is a topic model jointly constructed by each data end based on the respective corpus, and includes implicit semantic information of text information in each data end, that is, more implicit semantic information, and the data end uses the global topic model to mine topic distribution of the document to be processed, so that the obtained topic distribution is more accurate.
Further, after the data end obtains the theme distribution of the document to be processed, the data end can classify the document to be processed by adopting the theme distribution. Specifically, the theme distribution of the document to be processed may be used as feature data, and the feature data is input into a preset text classification model to obtain a text classification category of the document to be processed. The text classification model may be a text classification model that is set and trained in advance according to a text classification task, and the text classification model may adopt an SVM (Support Vector Machine) model structure, which is not described in detail herein.
Further, based on the first and second embodiments, a third embodiment of the topic model construction method of the present invention is provided. In this embodiment, a topic model construction method is applied to a coordination side, and the topic model construction method includes:
step A10, receiving local subject term distribution sent by each data end, wherein each data end carries out local subject model construction on each corpus to obtain local subject term distribution corresponding to the corpus;
in this embodiment, several data terminals may jointly perform topic model construction, each data terminal maintains one or more corpora, and each data terminal uses the corpus of its own home terminal to participate in the construction of the global topic model. A corpus may include a plurality of pieces of text data, which may be an article or a section of speech, etc. (hereinafter collectively referred to as "documents"). Specifically, each data terminal may establish a communication connection with the coordination terminal, and may be one of the data terminals to execute the function of the coordination terminal, or may be a third party serving as the coordination terminal.
Since the operations executed by the data terminals in the process of jointly performing the topic model construction are the same, the following description will take one data terminal as an example.
The coordination terminal can receive the local subject term distribution sent by each data terminal. The data end can adopt the corpus of the local end to construct a local topic model, obtain local topic word distribution (also called topic-word distribution) corresponding to the corpus, and send the local topic word distribution to the coordination end. Specifically, the local topic model construction mode of the data end on the corpus at the local end may adopt an existing topic model construction mode, for example, a topic model construction mode based on LDA (Latent dirichletaillocation, implicit dirichlet distribution) or a topic model construction mode based on LSA (Latent semantic Analysis), and the like. And the data side builds a local topic model to obtain local topic word distribution corresponding to the corpus. Before the theme model is constructed, the number of themes to be divided and the local theme word distribution, namely a theme-word probability matrix, are preset.
Step A20, merging the local subject term distributions to obtain a global subject term distribution, and obtaining a global subject model according to the global subject term distribution.
And the coordination end merges the local subject term distributions to obtain the global subject term distribution. Specifically, the coordination terminal may directly splice and merge the local subject term distributions, and the merged subject term distribution is used as a global subject term distribution, for example, there are 2 data terminals, and the local subject term distribution of each data terminal includes 100 subjects and a term distribution corresponding to each subject, so that a total subject term distribution including 200 subjects and a term distribution corresponding to each subject is obtained by direct splicing, and the total subject term distribution is used as a global subject term distribution. After the coordination terminal obtains the global subject term distribution, the global subject term distribution can be directly used as a global subject model. The coordination terminal can also send the global subject term to each data terminal, each data terminal carries out local subject model construction again on the basis of the global subject term distribution to obtain new local subject term distribution, and then sends the new local subject term distribution to the coordination terminal; the coordination terminal merges the new local subject term distribution to obtain a new global subject term distribution, and then issues the global subject term distribution again; and when the coordination terminal detects that the cycle number reaches the preset cycle number, taking the latest global topic word distribution as a global topic model.
In this embodiment, a local topic model is constructed on a local corpus by a data end to obtain local topic word distribution corresponding to the local corpus, the local topic word distribution is sent to a coordination end, the local topic word distribution sent by the coordination end is combined with the local topic word distribution sent by each data end to obtain global topic word distribution, and then a global topic model is obtained. Moreover, each data end does not directly send the corpus to the coordination end, but distributes and sends the subject words to the coordination party, and text information in the corpus cannot be directly exposed, so that the quality of the subject model is improved, and the privacy information in the corpus of each data end is protected.
Further, in an embodiment, the step of merging the local topic word distributions in the step a20 to obtain a global topic word distribution includes:
step A201, merging the local subject term distributions to obtain a total subject term distribution;
after the coordination terminal obtains the local subject term distribution sent by each data terminal, the local subject term distribution can be directly spliced and combined to obtain a total subject term distribution.
Step A202, calculating the similarity between every two topics according to the word distribution corresponding to each topic in the total topic word distribution, and acquiring a topic pair with the similarity larger than a preset threshold value based on the calculation result;
because each data end is constructed based on the topic model of the respective corpus, the two data ends may obtain topics with high similarity in the local topic word distribution, and in order to improve the topic division accuracy, the coordination end can combine topics with high similarity in the overall topic word distribution. Specifically, the coordination terminal may calculate the similarity between each topic and each other according to the word distribution corresponding to each topic in the topic word distribution, that is, traverse the topics in the topic word distribution, and calculate the similarity between two topics each time to obtain the similarity between each topic and other topics. The way for the coordination side to calculate the similarity according to the word distribution of the two topics may be: counting the total number of words (the same words are marked as 1) contained in the two subjects, counting the number of the same words in the words contained in the two subjects, dividing the total number by the number of the same words to obtain a percentage, and taking the percentage as the similarity of the two subjects. After the similarity of the two themes is obtained through calculation, the coordination terminal detects whether the similarity of the two themes is larger than a preset threshold value, and if so, the two themes are used as a theme pair to be combined. For example, when the similarity of two subjects is greater than 60%, the two subjects are taken as a pair of subjects to be merged.
Step A203, merging the word distributions corresponding to the two topics in the topic pair to update the total topic distribution, and taking the updated total topic word distribution as the global topic word distribution.
The coordination terminal may have a plurality of topic pairs finally determined, and for each topic pair, the coordination terminal merges word distributions corresponding to two topics in the topic pair, that is, completes merging of the two topics. And after the theme pairs are subjected to theme combination, the remaining uncombined themes in the total theme words and the newly combined theme are used as the updated middle theme distribution. The coordination end can take the updated overall subject distribution as the global subject distribution. The process of merging word distributions corresponding to the two topics by the coordination end may specifically be: the corresponding probabilities of the same words in the two subjects are averaged to take the same words as one word, and the probabilities of the remaining words which are different are kept unchanged.
In this embodiment, the topics with higher similarity in the total topic word distribution are combined through the coordination end to obtain the global topic word distribution, so that the division of each topic in the global topic word distribution is more accurate, and the accuracy of the global topic model is further improved.
Further, the step SA20 of obtaining a global topic model according to the global topic word distribution includes:
step A204, detecting whether the cycle number of the global topic model construction currently carried out with each data terminal reaches a preset number;
after the coordination terminal obtains the global topic word distribution, whether the cycle number of the global topic model construction currently carried out with each data terminal is larger than the preset number can be detected. Specifically, before the coordination terminal and the data terminal start to perform global topic model construction, the cycle number may be initialized to 0, after obtaining global topic word distribution each time, the cycle number is added by 1, and then it is detected whether the cycle number reaches a preset number, that is, whether the cycle number is equal to the preset number, where the preset number is set in advance according to needs, for example, set to 100.
Step A205, if it is detected that the cycle number does not reach the preset number, sending the global topic word distribution to each data end, so that each data end performs local topic model construction on the respective corpus based on the global topic word distribution to obtain a step of local topic word distribution corresponding to the corpus for next global topic model construction;
and if the coordination terminal detects that the cycle number does not reach the preset number, namely the cycle number is smaller than the preset number, the global subject term distribution is sent to each data terminal. And each data terminal carries out local topic model construction on the respective corpus based on the global topic word distribution to obtain the local topic word distribution corresponding to the corpus. That is, the data side performs local topic model construction on respective corpus on the basis of global topic word distribution to obtain local topic word distribution corresponding to the corpus, and specifically, the local topic model construction mode can still adopt the existing topic model construction mode, which is not described in detail herein. And after the data end obtains the new local subject term distribution, the data end sends the new local subject term distribution to the coordination end. And after receiving the new local subject term distribution, the coordination terminal merges to obtain new global subject term distribution, adds 1 to the cycle number, detects whether the cycle number is greater than the preset number, and circulates in the way.
Step A206, if the cycle times are detected to reach the preset times, taking the global topic word distribution as a global topic model.
And if the coordination terminal detects that the cycle number reaches the preset number, namely the cycle number is equal to the preset number, taking the current global topic word distribution as a global topic model. That is, the global topic model is constructed for a plurality of times in a circulating manner until the preset number of times is reached.
In this embodiment, the coordination end and each data end jointly and circularly perform global topic model construction for multiple times, so that the global topic model can be converged on the corpus of each data end finally, a more accurate global topic model is obtained, and the quality of the global topic model is also improved.
In addition, an embodiment of the present invention further provides a topic model building apparatus, where the apparatus is deployed at a data end, and the apparatus includes:
the building module is used for building a local topic model for a local corpus to obtain local topic word distribution corresponding to the local corpus;
and the sending module is used for sending the local thematic word distribution to the coordination terminal so that the coordination terminal can combine the local thematic word distributions received from the data terminals to obtain global thematic word distribution, and a global thematic model is obtained according to the global thematic word distribution matrix.
Further, the building module comprises:
the processing unit is used for carrying out differential privacy processing on the corpus of the local end to obtain a processed corpus;
and the construction unit is used for constructing the local topic model of the processed corpus so as to obtain the local topic word distribution corresponding to the processed corpus.
Further, the topic model comprises topic word distribution, and the building module comprises:
the sampling unit is used for sampling according to a preset sampling algorithm to obtain the theme of each word of each document in the local corpus;
and the counting unit is used for counting the topic of each word in each document to obtain the local topic word distribution corresponding to the corpus.
Further, the topic model construction device further includes:
the first receiving module is used for receiving the global topic model sent by the coordinating terminal;
the sampling module is used for obtaining the theme of each word in the document to be processed through sampling according to a preset sampling algorithm based on the global theme model;
and the processing module is used for obtaining the theme distribution of the document to be processed based on the theme of each word in the document to be processed.
In addition, an embodiment of the present invention further provides a topic model building apparatus, where the apparatus is deployed at a coordination end, and the apparatus includes:
the second receiving module is used for receiving local subject term distribution sent by each data end, wherein each data end carries out local subject model construction on a respective corpus to obtain local subject term distribution corresponding to the corpus;
and the merging module is used for merging the local subject term distributions to obtain global subject term distributions and obtaining a global subject model according to the global subject term distributions.
Further, the merging module includes:
the first merging unit is used for merging the local subject term distribution to obtain total subject term distribution;
the calculating unit is used for calculating the similarity between every two themes according to the word distribution corresponding to each theme in the total theme word distribution and acquiring a theme pair with the similarity larger than a preset threshold value based on the calculation result;
and the second merging unit is used for merging the word distributions corresponding to the two themes in the theme pair so as to update the total theme distribution, and taking the updated total theme word distribution as the global theme word distribution.
Further, the merging module includes:
the detection unit is used for detecting whether the cycle number of the global topic model construction currently carried out with each data terminal reaches a preset number;
a sending unit, configured to send the global topic word distribution to each data end if it is detected that the cycle number does not reach the preset number, so that each data end performs local topic model construction on a respective corpus based on the global topic word distribution to obtain a step of local topic word distribution corresponding to the corpus, so as to perform next global topic model construction;
and the determining unit is used for taking the global topic word distribution as a global topic model if the cycle times are detected to reach the preset times.
The specific implementation of the topic model construction device of the present invention has basically the same expansion content as the embodiments of the topic model construction method described above, and is not described herein again.
Furthermore, an embodiment of the present invention further provides a computer-readable storage medium, in which a topic model building program is stored, and the topic model building program, when executed by a processor, implements the steps of the topic model building method described below.
The embodiments of the subject model construction apparatus and the computer-readable storage medium of the present invention can refer to the embodiments of the subject model construction method of the present invention, and are not described herein again.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A method for constructing a topic model is applied to a data side and comprises the following steps:
performing local topic model construction on a local corpus to obtain local topic word distribution corresponding to the local corpus;
and sending the local subject term distribution to a coordination terminal, so that the coordination terminal combines the local subject term distributions received from the data terminals to obtain global subject term distribution, and obtaining a global subject model according to the global subject term distribution matrix.
2. The topic model construction method of claim 1, wherein the step of performing local topic model construction on the local corpus to obtain local topic word distribution corresponding to the local corpus comprises:
carrying out differential privacy processing on the corpus of the local end to obtain a processed corpus;
and performing local topic model construction on the processed corpus to obtain local topic word distribution corresponding to the processed corpus.
3. The topic model construction method of claim 1, wherein the topic model comprises topic word distribution, and the step of performing local topic model construction on the local corpus to obtain local topic word distribution corresponding to the local corpus comprises:
sampling according to a preset sampling algorithm to obtain the theme of each word of each document in the local corpus;
and obtaining the local-end subject word distribution corresponding to the corpus based on the subject statistics of each word in each document.
4. The topic model construction method of any one of claims 1 to 3, wherein after the step of sending the local topic word distributions to a coordinating end for the coordinating end to combine the local topic word distributions received from the data ends to obtain a global topic word distribution, and obtaining a global topic model according to the global topic word distribution matrix, the method further comprises:
receiving the global topic model sent by the coordination terminal;
sampling according to a preset sampling algorithm based on the global topic model to obtain the topic of each word in the document to be processed;
and obtaining the theme distribution of the document to be processed based on the theme of each word in the document to be processed.
5. A topic model construction method is applied to a coordination terminal, and comprises the following steps:
receiving local subject term distribution sent by each data end, wherein each data end carries out local subject model construction on each corpus to obtain local subject term distribution corresponding to the corpus;
and merging the local subject term distributions to obtain global subject term distributions, and obtaining a global subject model according to the global subject term distributions.
6. The topic model construction method of claim 5, wherein the step of merging each of the local topic word distributions to obtain a global topic word distribution comprises:
merging the local subject term distributions to obtain a total subject term distribution;
calculating the similarity between every two topics according to the word distribution corresponding to each topic in the total topic word distribution, and acquiring a topic pair with the similarity larger than a preset threshold value based on the calculation result;
and combining the word distributions corresponding to the two themes in the theme pair to update the total theme distribution, and taking the updated total theme word distribution as the global theme word distribution.
7. The topic model construction method of any one of claims 5 to 6, wherein the step of deriving a global topic model from the global topic word distribution comprises:
detecting whether the cycle number of the current global topic model construction with each data terminal reaches a preset number;
if the cycle times are detected not to reach the preset times, sending the global subject word distribution to each data end, so that each data end can execute local subject word model construction on each corpus based on the global subject word distribution to obtain the local subject word distribution corresponding to the corpus, and performing the next global subject word model construction;
and if the cycle times reach the preset times, taking the global topic word distribution as a global topic model.
8. A topic model construction device characterized by comprising: a memory, a processor and a topic model builder stored on the memory and executable on the processor, the topic model builder when executed by the processor implementing the steps of the topic model building method of any one of claims 1 to 4.
9. A topic model construction device characterized by comprising: a memory, a processor and a topic model builder stored on the memory and executable on the processor, the topic model builder when executed by the processor implementing the steps of the topic model building method of any one of claims 5 to 7.
10. A computer-readable storage medium, characterized in that a topic model construction program is stored thereon, which when executed by a processor implements the steps of the topic model construction method according to any one of claims 1 to 7.
CN202010752688.4A 2020-07-30 2020-07-30 Topic model construction method and device and computer readable storage medium Pending CN111859917A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010752688.4A CN111859917A (en) 2020-07-30 2020-07-30 Topic model construction method and device and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010752688.4A CN111859917A (en) 2020-07-30 2020-07-30 Topic model construction method and device and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN111859917A true CN111859917A (en) 2020-10-30

Family

ID=72945131

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010752688.4A Pending CN111859917A (en) 2020-07-30 2020-07-30 Topic model construction method and device and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN111859917A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116796860A (en) * 2023-08-24 2023-09-22 腾讯科技(深圳)有限公司 Federal learning method, federal learning device, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116796860A (en) * 2023-08-24 2023-09-22 腾讯科技(深圳)有限公司 Federal learning method, federal learning device, electronic equipment and storage medium
CN116796860B (en) * 2023-08-24 2023-12-12 腾讯科技(深圳)有限公司 Federal learning method, federal learning device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
US20190207985A1 (en) Authorization policy recommendation method and apparatus, server, and storage medium
EP3905126A2 (en) Image clustering method and apparatus
CN111061874B (en) Sensitive information detection method and device
CN111243601B (en) Voiceprint clustering method and device, electronic equipment and computer-readable storage medium
CN113222942A (en) Training method of multi-label classification model and method for predicting labels
CN109948160B (en) Short text classification method and device
CN113255370A (en) Industry type recommendation method, device, equipment and medium based on semantic similarity
CN110210038B (en) Core entity determining method, system, server and computer readable medium thereof
CN111241813B (en) Corpus expansion method, apparatus, device and medium
CN110162637B (en) Information map construction method, device and equipment
CN110738056B (en) Method and device for generating information
CN112528641A (en) Method and device for establishing information extraction model, electronic equipment and readable storage medium
CN113657249B (en) Training method, prediction method, device, electronic equipment and storage medium
CN110096605B (en) Image processing method and device, electronic device and storage medium
JP6563350B2 (en) Data classification apparatus, data classification method, and program
CN109885831B (en) Keyword extraction method, device, equipment and computer readable storage medium
CN110619253B (en) Identity recognition method and device
CN111859917A (en) Topic model construction method and device and computer readable storage medium
CN113947701A (en) Training method, object recognition method, device, electronic device and storage medium
CN113934848A (en) Data classification method and device and electronic equipment
CN112765357A (en) Text classification method and device and electronic equipment
CN115858776B (en) Variant text classification recognition method, system, storage medium and electronic equipment
CN114676677A (en) Information processing method, information processing apparatus, server, and storage medium
CN114297380A (en) Data processing method, device, equipment and storage medium
CN113408632A (en) Method and device for improving image classification accuracy, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination