CN114528588A

CN114528588A - Cross-modal privacy semantic representation method, device, equipment and storage medium

Info

Publication number: CN114528588A
Application number: CN202210089691.1A
Authority: CN
Inventors: 程正涛; 张伟哲; 束建钢; 杨帆; 邹庆胜
Original assignee: Peng Cheng Laboratory
Current assignee: Peng Cheng Laboratory
Priority date: 2022-01-25
Filing date: 2022-01-25
Publication date: 2022-05-24

Abstract

The invention discloses a cross-modal privacy semantic representation method, a device, equipment and a storage medium, which relate to the technical field of data processing, and the method comprises the following steps: obtaining multi-modal data; obtaining corresponding text data according to the multi-modal data; extracting and encrypting keywords from the text data to obtain secret keywords; according to the dense-state keywords, segmenting a preset knowledge graph to obtain a dense-state subgraph; and (4) carrying out graph embedding on the dense subgraph to obtain dense representation vectors corresponding to the dense keywords so as to obtain semantic representation results of the multi-modal data. The method and the device solve the problem of poor semantic association among the dense-state keywords in the prior art, and realize the effect of ensuring the semantic association among the dense-state keywords and providing accurate semantic representation for the follow-up retrieval of privacy semantics.

Description

Cross-modal privacy semantic representation method, device, equipment and storage medium

Technical Field

The invention relates to the technical field of data processing, in particular to a cross-modal privacy semantic representation method, a cross-modal privacy semantic representation device, cross-modal privacy semantic representation equipment and a storage medium.

Background

With the development of internet technology and the popularization of cloud service technology, the contradiction between big data sharing and privacy protection is more and more severe. Based on the data, the retrieval of the cross-modal data becomes a rigid requirement in cloud service and big data era, and the semantic representation of the cross-modal data is a key component of a cross-modal data retrieval system.

The cross-modal semantic representation technology is to encode different modal data through a model to obtain keywords, so that the keywords of different modal data with the same semantic content can have higher relevance and can be explicitly calculated. The cross-modal privacy semantic representation technology is a technology for adding privacy protection requirements on the basis of the cross-modal privacy semantic representation technology, and the technology requires that a retrieval system can encode cross-modal data on the premise of not uploading plaintext data to a cloud server to obtain dense-state keywords, and then retrieve privacy semantics according to the dense-state keywords. However, the current cross-modal privacy semantic representation technology has the problem of poor semantic relevance among the dense keywords.

Disclosure of Invention

The main purposes of the invention are as follows: a cross-modal privacy semantic representation method, a device, equipment and a storage medium are provided, and the technical problem that semantic relevance among dense keywords is poor in the prior art is solved.

In order to achieve the purpose, the invention adopts the following technical scheme:

in a first aspect, the present invention provides a cross-modal privacy semantic representation method, including:

obtaining multi-modal data;

obtaining corresponding text data according to the multi-modal data;

extracting and encrypting the keywords of the text data to obtain secret keywords;

according to the dense-state keywords, segmenting the preset knowledge graph to obtain a dense-state subgraph;

and carrying out graph embedding on the dense-state subgraph to obtain a dense-state representation vector corresponding to the dense-state key words so as to obtain a semantic representation result of the multi-modal data.

Optionally, in the above cross-modality privacy semantic representation method, the multi-modality data includes data information of at least two different modalities;

the step of obtaining corresponding text data from the multimodal data comprises:

when the multi-modal data comprises first modal data of a voice modality, converting the first modal data into first text data by utilizing a voice recognition technology;

when the multi-modal data comprise second modal data of a video modality, converting the second modal data into second text data by utilizing a trained text generation model;

when the multi-modal data includes third modal data of a text modality, directly determining the third modal data as third text data.

Optionally, in the above cross-modal privacy semantic representation method, the step of extracting and encrypting the keywords from the text data to obtain the secret keywords includes:

and extracting and encrypting the keywords of the first text data, the second text data and/or the third text data to obtain secret keywords.

extracting keywords from the text data through an unsupervised learning algorithm to obtain keywords;

and encrypting the keywords through a symmetric encryption algorithm to obtain secret keywords.

Optionally, in the above cross-modal privacy semantic representation method, the step of extracting the keywords from the text data by using an unsupervised learning algorithm to obtain the keywords includes:

performing word segmentation processing on the text data to obtain a plurality of words;

drawing a vocabulary network diagram according to the vocabularies; the network nodes of the vocabulary network graph correspond to the vocabularies, edges connecting the two network nodes have attribute values, and the attribute values are determined according to the co-occurrence relation of the vocabularies;

and sequencing and screening the vocabularies according to the vocabulary network diagram to obtain keywords representing the text data.

Optionally, in the above cross-modality privacy semantic representation method, before the step of segmenting the preset knowledge graph according to the dense-state keyword to obtain a dense-state sub-graph, the method further includes:

determining a basic knowledge graph through the open source knowledge graph;

encrypting the basic knowledge graph to obtain a preset knowledge graph; and the encryption algorithm adopted by the encryption processing is consistent with the encryption algorithm adopted by the encryption of the text data.

Optionally, in the above cross-modal privacy semantic representation method, the step of segmenting the preset knowledge graph according to the dense-state keyword to obtain a dense-state sub-graph includes:

according to the secret key words, entities corresponding to the secret key words are matched in the preset knowledge graph to obtain knowledge nodes;

in the preset knowledge graph, the knowledge nodes are used as centers, and segmentation is carried out according to preset cutting distances to obtain dense subgraphs; the length unit of the preset cutting distance is an edge between two entities, and the dense subgraph is a set of the entities and the edge within the preset cutting distance range with the knowledge node as the center.

In a second aspect, the present invention provides a cross-modal privacy semantic representation apparatus, including:

the data acquisition module is used for acquiring multi-modal data;

the text description module is used for obtaining corresponding text data according to the multi-modal data;

the keyword extraction module is used for extracting and encrypting the keywords of the text data to obtain secret keywords;

the map segmentation module is used for segmenting the preset knowledge map according to the dense-state key words to obtain dense-state sub-maps;

and the graph embedding module is used for embedding the graph into the dense subgraph to obtain a dense representation vector corresponding to the dense keyword so as to obtain a semantic representation result of the multi-modal data.

In a third aspect, the present invention provides a cross-modal privacy semantic representation device, where the device includes a processor and a memory, where the memory stores a cross-modal privacy semantic representation program, and when the cross-modal privacy semantic representation program is executed by the processor, the cross-modal privacy semantic representation device implements the cross-modal privacy semantic representation method.

In a fourth aspect, the invention provides a computer-readable storage medium having stored thereon a computer program executable by one or more processors to implement a cross-modality privacy semantic characterization method as described above.

One or more technical solutions provided by the present invention may have the following advantages or at least achieve the following technical effects:

according to the cross-modal privacy semantic representation method, device, equipment and storage medium, after corresponding text data are obtained according to the obtained multi-modal data, keyword extraction and encryption are carried out on the text data to obtain the secret keywords, so that data privacy can be protected, and data safety is guaranteed; the preset knowledge graph is segmented according to the dense-state keywords to obtain dense-state subgraphs, semantic information of the dense-state keywords is effectively expanded in the form of sub-knowledge graphs, semantic concepts of the dense-state keywords can be expressed more comprehensively, and strong correlation with the dense-state keywords is kept; and the dense-state sub-graph is embedded to obtain dense-state representation vectors corresponding to the dense-state keywords, so that the semantic representation result of the multi-modal data is obtained, richer semantic information is encoded, and the semantic information relevance among the dense-state keywords is ensured. According to the invention, on the premise of ensuring the privacy of user data, the semantic representation of the cross-modal data is realized, so that not only can the semantic association between the dense-state keywords be ensured, but also the accurate semantic representation can be provided for the follow-up retrieval of privacy semantics, and meanwhile, the supported modal data can be dynamically increased or decreased according to the business requirements, and the flexibility is higher.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flowchart illustrating a cross-modal privacy semantic representation method according to a first embodiment of the present invention;

FIG. 2 is a schematic diagram of a hardware structure of a cross-modal privacy semantic representation apparatus according to the present invention;

FIG. 3 is a flowchart illustrating a cross-modal privacy semantic representation method according to a second embodiment of the present invention;

FIG. 4 is a diagram illustrating a basic knowledge graph according to a second embodiment of the cross-modal privacy semantic representation method of the present invention;

FIG. 5 is a schematic diagram of a secret subgraph in a second embodiment of the cross-modal privacy semantic representation method according to the present invention;

FIG. 6 is a functional block diagram of a cross-modal privacy semantic representation apparatus according to a first embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that, in the present invention, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element. In addition, the meaning of "and/or" appearing throughout includes three juxtapositions, exemplified by "A and/or B" including either A or B or both A and B.

In the present invention, if there is a description referring to "first", "second", etc., the description of "first", "second", etc. is for descriptive purposes only and is not to be construed as indicating or implying relative importance or implicit indication of the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the present invention, suffixes such as "module", "part", or "unit" used to represent elements are used only for facilitating the description of the present invention, and have no specific meaning in themselves. Thus, "module", "component" or "unit" may be used mixedly.

The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations. In addition, the technical solutions of the respective embodiments may be combined with each other, but must be realized by those skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination of technical solutions should be considered to be absent and not be within the protection scope of the present invention.

Analysis of the prior art shows that with the development of internet technology and the popularization of cloud service technology, the contradiction between big data sharing and privacy protection is more and more severe. Retrieval is a common function in big data scenes, and is generally realized by a set of retrieval systems. In the current background, the requirements of ensuring data privacy and security and improving data retrieval capability are important for a retrieval system. Meanwhile, with the improvement of the technical level and the development of the spiritual pursuit of people, the concept of privacy and data is continuously enriched. Privacy is no longer traditional sensitive information such as narrow identity information and financial information, and data such as life and travel are increasingly endowed with privacy concepts; data also gradually evolves from traditional data concepts such as text and tables to concepts with coexistence of multiple modalities such as video, audio and geography.

Based on the data, the retrieval of the cross-modal data becomes a rigid requirement in cloud service and big data era, and the semantic representation of the cross-modal data is a key component of a cross-modal data retrieval system.

The cross-modal semantic representation technology is used for performing combined modeling on data in different modalities to obtain a model which can encode the data in different modalities to obtain keywords, so that the keywords in the data in different modalities of the same semantic have higher relevance and can be explicitly calculated. For example, a text mode "many people are in a meeting in a conference room" and an image mode picture in a conference room are respectively input into a model for encoding, and semantic relevance between an obtained text keyword and an image keyword is higher than semantic relevance between the text keyword and a keyword obtained by encoding data of other semantic contents.

The cross-modal privacy semantic representation technology is a technology for adding privacy protection requirements on the basis of the cross-modal privacy semantic representation technology, and the technology requires that a retrieval system can encode cross-modal data on the premise of not uploading plaintext data to a cloud server to obtain dense-state keywords, and then retrieve privacy semantics according to the dense-state keywords.

The current cross-modal privacy semantic representation technology has some problems, such as:

1. the extraction of keywords of cross-modal data is difficult;

2. less supported modalities and difficult to amplify;

3. the semantic relevance among the dense keywords is poor.

In view of the technical problem in the prior art that semantic relevance among dense-state keywords is poor, the invention provides a cross-modal privacy semantic representation method, which has the following general idea:

obtaining multi-modal data; obtaining corresponding text data according to the multi-modal data; extracting and encrypting the keywords of the text data to obtain secret keywords; according to the dense-state keywords, segmenting the preset knowledge graph to obtain a dense-state subgraph; and carrying out graph embedding on the dense-state subgraph to obtain a dense-state representation vector corresponding to the dense-state key words so as to obtain a semantic representation result of the multi-modal data.

According to the technical scheme, after corresponding text data are obtained according to the obtained multi-modal data, keyword extraction and encryption are carried out on the text data to obtain secret keywords, so that data privacy can be protected, and data safety is guaranteed; the preset knowledge graph is segmented according to the dense-state keywords to obtain dense-state subgraphs, semantic information of the dense-state keywords is effectively expanded in the form of sub-knowledge graphs, semantic concepts of the dense-state keywords can be expressed more comprehensively, and strong correlation with the dense-state keywords is kept; and the dense-state sub-graph is embedded to obtain dense-state representation vectors corresponding to the dense-state keywords, so that the semantic representation result of the multi-modal data is obtained, richer semantic information is encoded, and the semantic information relevance among the dense-state keywords is ensured. According to the invention, on the premise of ensuring the privacy of user data, the semantic representation of the cross-modal data is realized, so that not only can the semantic association between the dense-state keywords be ensured, but also the accurate semantic representation can be provided for the follow-up retrieval of privacy semantics, and meanwhile, the supported modal data can be dynamically increased or decreased according to the business requirements, and the flexibility is higher.

The cross-modal privacy semantic representation method, apparatus, device and storage medium provided by the present invention are described in detail by specific embodiments and implementations with reference to the accompanying drawings.

Example one

Referring to the flowchart illustration of fig. 1, a first embodiment of the cross-modal privacy semantic representation method according to the present invention is provided, and the method is applied to a cross-modal privacy semantic representation device. The cross-modal privacy semantic representation device refers to a terminal device or a network device capable of realizing network connection, and may be a terminal device such as a mobile phone, a computer, a tablet computer, or a network device such as a server and a cloud platform. The method can also be applied to a semantic retrieval system comprising terminal equipment and network equipment, when the method is applied to the semantic retrieval system, part of steps of the method can be carried out on the terminal equipment, and after the obtained result is sent to the network equipment, the rest of steps are carried out on the network equipment continuously.

Fig. 2 is a schematic diagram of a hardware structure of a cross-modal privacy semantic representation device. The apparatus may include: a processor 1001, such as a CPU (Central Processing Unit), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Those skilled in the art will appreciate that the hardware architecture shown in FIG. 2 does not constitute a limitation of the cross-modal privacy semantic characterization device of the present invention, and may include more or fewer components than those shown, or some components in combination, or a different arrangement of components.

Specifically, the communication bus 1002 is used for realizing connection communication among these components;

the user interface 1003 is used for connecting a client and performing data communication with the client, the user interface 1003 may include an output unit, such as a display screen, an input unit, such as a keyboard, and optionally, the user interface 1003 may further include other input/output interfaces, such as a standard wired interface and a wireless interface;

the network interface 1004 is used for connecting to the backend server and performing data communication with the backend server, and the network interface 1004 may include an input/output interface, such as a standard wired interface, a wireless interface, such as a Wi-Fi interface;

the memory 1005 is used for storing various types of data, which may include, for example, instructions of any application program or method in the cross-modal privacy semantic representation apparatus and application program related data, and the memory 1005 may be a high-speed RAM memory, or a stable memory such as a disk memory, and optionally, the memory 1005 may also be a storage device independent of the processor 1001;

specifically, with continued reference to fig. 2, the memory 1005 may include an operating system, a network communication module, a user interface module, and a cross-modal privacy semantic representation program, where the network communication module is mainly used to connect to a server and perform data communication with the server;

the processor 1001 is configured to invoke a cross-modal privacy semantics characterizing program stored in the memory 1005 and perform the following operations:

obtaining multi-modal data;

obtaining corresponding text data according to the multi-modal data;

Based on the above cross-modal privacy semantic representation device, the following describes in detail the cross-modal privacy semantic representation method according to this embodiment with reference to the flowchart shown in fig. 1. The method may comprise the steps of:

step S110: multimodal data is acquired.

Multimodal data refers to data information having a plurality of different modalities, such as video data, audio data, image data, text data, and so on. The user can input manually through the user interface of the device, and can also receive data transmitted by other devices through the network interface of the device. The data acquired here may be data including multiple modalities, or may include data of only one modality, and only data including multiple modalities are described here as an example.

Step S120: and obtaining corresponding text data according to the multi-modal data.

For the multi-modal data obtained in step S110, when the data includes data in a non-text modality such as an image, an audio, or a video, modality conversion is required to be performed, the data is converted from the non-text modality to a text modality, and when the data includes data in a text modality, modality conversion is not required to be performed, and the data in the text modality is directly used as the text data finally obtained in the step. When the data in the non-text mode is subjected to mode conversion, the model obtained by training the self characteristics of the data in the non-text modes such as pictures, audios or videos can be combined to perform text recognition to obtain the text describing the data in the non-text mode, so that the accuracy of text description on the data in the non-text mode can be improved.

In a specific implementation process, when the multi-modal data contains image data, text data corresponding to the image data can be obtained by using an image recognition technology, and text description can be specifically performed by using a pre-trained deep learning model; when the multi-modal data comprises audio data, text data corresponding to the audio data can be obtained by utilizing a Speech recognition technology, and text description can be specifically carried out by utilizing Deep learning models such as Deep Speech V2 and the like; when the multi-modal data contains Video data, text data corresponding to the Video data can be obtained by using a text description generation technology, and specifically, text description can be performed by using a pre-trained text generation model such as Video BERT. It should be noted that no matter how many kinds of modal data are included in the multi-modal data, after the non-text modal data are converted into the text modal data, all the text modal data, that is, the data that originally is the text modal and the text modal data obtained by conversion are summarized to obtain the text data corresponding to the obtained multi-modal data.

Step S130: and extracting and encrypting the keywords of the text data to obtain secret keywords.

The method comprises the following steps of extracting keywords from text data because the text data may have the reasons that more words or unnecessary words easily influence semantic representation accuracy and the like; and because the privacy of the data is ensured, the extracted keywords are encrypted, so that the secret keywords of the multi-modal data are obtained. The keyword may be extracted by a Text Rank algorithm, and the extracted keyword may be encrypted by a symmetric Encryption algorithm, for example, a DES (Data Encryption Standard) algorithm, an RC (Rivest Cipher) algorithm, a BlowFish algorithm, and the like. It should be noted that, when the method of this embodiment is performed on a terminal device or a network device, the secret-state keywords may be continuously processed subsequently to obtain the semantic representation result of the cross-modal data, and when the method of this embodiment is performed on a semantic retrieval system including the terminal device and the network device, after the terminal device performs steps S110 to S130, the obtained secret-state keywords may be sent to the network device, and after the network device receives the secret-state keywords, the secret-state keywords may be processed to obtain the semantic representation result of the cross-modal data, where both the sending and receiving processes are encrypted, and thus the privacy of the data may still be ensured.

Step S140: and segmenting the preset knowledge graph according to the dense-state keywords to obtain a dense-state subgraph.

The preset knowledge graph is a graph structure which visually describes the vocabulary entities in the form of nodes and the relationships among the entities in the form of edges, and can explicitly describe the association among the vocabularies. The preset knowledge graph is an encrypted open source knowledge graph such as Wikidata, wherein the encryption mode and the key are consistent with the encryption mode and the key for encrypting the keywords in the step S30, the processing of the keywords in a secret state after the keys are matched can be ensured, and the privacy of data is ensured. According to the secret key words, the preset knowledge graph is divided, namely the secret key words are matched with nodes in the preset knowledge graph, a sub knowledge graph with the secret key words as centers is obtained according to the set cutting distance, the sub knowledge graph comprises entities corresponding to the secret key words, other entities in the set cutting distance range around the entities and a set of edges representing the incidence relation between the entities, and the sub knowledge graph obtained through cutting is also in an encryption state due to the fact that the preset knowledge graph is in the encryption state, namely the secret sub graph is obtained. The number of the dense subgraphs is consistent with that of the dense keywords, namely the dense subgraphs are in one-to-one correspondence with the dense keywords.

The sub-knowledge graph taking the dense-state keywords as the center is used as the expression of the dense-state keywords, and the entities and the relations with strong correlation with the dense-state keywords are used for jointly representing the dense-state keywords, so that the semantic concepts of the dense-state keywords can be more comprehensively expressed, and the semantic concepts of the dense-state keywords can be accurately represented through other related entities and corresponding relations, so that the confusable dense-state keywords have more obvious discrimination.

Step S150: and carrying out graph embedding on the dense-state subgraph to obtain a dense-state representation vector corresponding to the dense-state key words so as to obtain a semantic representation result of the multi-modal data.

Each dense-state keyword can correspond to obtain a dense-state subgraph, and after the dense-state subgraph is obtained in step S140, graph embedding operation is performed on the dense-state subgraph based on a random walk algorithm to obtain a dense-state characterization vector thereof, that is, a dense-state characterization vector corresponding to the dense-state keyword, where the vector is a semantic representation of the corresponding dense-state keyword. After obtaining the dense-state sub-graphs corresponding to all the dense-state keywords, and performing graph embedding operation to obtain the dense-state characterization vectors, and summarizing, a vector set, that is, a semantic characterization result of the multi-modal data of this embodiment, can be obtained. The semantic representation result can be directly used for input of semantic retrieval, so that a retrieval result is obtained, and retrieval of cross-modal data is achieved, for example, a user can directly input multi-modal data on equipment, after the method of the embodiment is used, the semantic representation result is obtained, and then the semantic representation result is input into a retrieval system for retrieval, so that a final cross-modal retrieval result can be obtained.

According to the cross-modal privacy semantic representation method provided by the embodiment, after corresponding text data is obtained according to the obtained multi-modal data, keyword extraction and encryption are performed on the text data to obtain the secret keywords, so that data privacy can be protected, and data security is ensured; the preset knowledge graph is segmented according to the dense-state keywords to obtain dense-state subgraphs, semantic information of the dense-state keywords is effectively expanded in the form of sub-knowledge graphs, semantic concepts of the dense-state keywords can be expressed more comprehensively, and strong correlation with the dense-state keywords is kept; and the dense-state sub-graph is embedded to obtain dense-state representation vectors corresponding to the dense-state keywords, so that the semantic representation result of the multi-modal data is obtained, richer semantic information is encoded, and the semantic information relevance among the dense-state keywords is ensured. According to the invention, on the premise of guaranteeing the privacy of user data, the semantic representation of cross-modal data is realized, so that not only can the semantic association between secret keywords be ensured, but also the accurate semantic representation can be provided for the subsequent retrieval of privacy semantics, and meanwhile, the supported modal data can be dynamically increased or decreased according to the business requirements, so that the method has more flexibility.

Example two

Based on the same inventive concept, referring to fig. 3 to 5, a second embodiment of the cross-modal privacy semantic representation method is provided, the method is applied to cross-modal privacy semantic representation equipment, the method can be further applied to a cross-modal privacy semantic retrieval system running on the equipment, and the system performs privacy semantic representation on multi-modal data through the method and then performs privacy semantic retrieval according to semantic representation results to obtain semantic retrieval results. The semantic retrieval is a process of automatically inquiring and extracting related information from an information source on a semantic level according to requirements under the condition of understanding the relation between semantics and words by correctly analyzing syntactic formats, and the private semantic retrieval can be based on private data for retrieval, and the security of the private data is further required.

The cross-modal privacy semantic representation method of the present embodiment is described in detail below with reference to the flowchart shown in fig. 3. The method may comprise the steps of:

step S210: multimodal data is acquired.

In particular, the multi-modal data comprises data information of at least two different modalities. The type and the quantity of the modal data in the multi-modal data are not limited, and the type and the quantity of the modal data can be increased or decreased according to the business requirements. The present embodiment is illustrated with multimodal data including three modality data of voice, video, and text, and english is taken as an example language.

Step S220: and obtaining corresponding text data according to the multi-modal data.

In a specific implementation process, the text generation model can be utilized to obtain corresponding text data according to the multi-modal data. Specifically, the multi-modal data is used as an input of a text generation model, and then each modal data is respectively input into a corresponding text generation model to respectively output text modal data, so that text data describing the multi-modal data is obtained.

Specifically, step S220 may include:

step S221: when the multi-modal data comprises first modal data of a speech modality, converting the first modal data into first text data by using a speech recognition technology.

The voice recognition technology is a technology for converting data including human speech into text and constructing a mapping relationship between the text and the voice. In this embodiment, a Deep Speech V2 model may be used as a Speech recognition model, which may be trained by using Speech data of a set language, where the model is trained by using english Speech data to obtain a Speech recognition model. In the implementation process, the first modal data is used as the input of the model, and the corresponding text description is directly output.

Step S222: and when the multi-modal data comprises second modal data of a video modality, converting the second modal data into second text data by using a trained text generation model.

The trained text generation model is obtained by training based on a data set which has text description and video content at the same time as training data. In the implementation process, a trained text is used for generating a model, the second modal data is used as the input of the model, and the corresponding text description is directly output.

In this embodiment, the text generation model is trained by using a Video BERT model, which is trained by using a BERT model in a language model and a self-supervised learning training method, and a data set including text description data and Video content data is used as training data when the text generation model is trained. The training method comprises the following steps: firstly, extracting a characteristic vector from video content data, discretizing the characteristic vector by a clustering method to construct a visual vocabulary, and then combining text description data to form a cross-modal vocabulary; and then deriving visual marks (token) and linguistic marks from the cross-modal vocabulary, inputting the visual marks and the linguistic marks into a text generation model to be trained, and learning the bidirectional joint distribution on the mark sequences by the model so as to construct a mapping relation between the visual marks and the linguistic marks and obtain the trained text generation model.

Step S223: when the multi-modal data includes third modal data of a text modality, directly determining the third modal data as third text data.

The text data is used as text modal data, processing is not needed, next operation can be directly carried out, the next operation can comprise summarizing the text data obtained through conversion and the text modal data to obtain the text data, keyword extraction and encryption can be carried out on the text data, keyword extraction and encryption can also be directly carried out on the text data obtained through conversion or the text modal data, the type and the quantity of the multi-modal data are specifically seen, and then the setting is carried out according to actual conditions.

In this embodiment, the multimodal data of the three modality data, i.e., the voice, the video, and the text, are respectively processed according to the above steps to obtain three text data, and then the three text data are summarized to obtain text data. All modal data are unified into text modes, the number of the modes is not limited, the types of the modes are not limited, when new modal data are added, only corresponding text description generation models need to be added, and the method has expansibility.

Step S230: and extracting and encrypting the keywords of the text data to obtain secret keywords.

In one embodiment, step 230 may comprise:

step 231: and extracting and encrypting the keywords of the first text data, the second text data and/or the third text data to obtain secret keywords.

Since the modal data types of the multimodal data can be increased or decreased and can be multiple, there are multiple corresponding text data, and when there are multiple text data, for example, three text data of the embodiment, some or all of the text data can be summarized and then keyword extraction and encryption processing can be directly performed to obtain the dense keywords corresponding to the multimodal data.

In another embodiment, step 230 may include:

step 232: and extracting keywords from the text data through an unsupervised learning algorithm to obtain keywords.

Keywords are generally words or phrases consisting of multiple words, and refer to generalized words or phrases that reflect the subject or meaning of the text. In the embodiment, Text keyword extraction is performed by using a Text Rank algorithm, which is an unsupervised algorithm and can use a single document or Text data as input. The algorithm principle is that a text is divided into words as network nodes to form a word network graph, the correlation relation among the words is regarded as a recommendation or voting relation, the importance of each word can be calculated, and then the first N words are obtained through screening and used as key words for representing the whole document or the whole text data.

Specifically, step 232 may include:

step 232.1: and performing word segmentation processing on the text data to obtain a plurality of words.

In the specific implementation process, word segmentation, part-of-speech tagging, stop word removal and other operations can be performed on the text data, wherein in the word segmentation, a Chinese character word is segmented by adopting a Chinese character 'ba', 7 parts-of-speech words such as common nouns, proper nouns, common verbs, auxiliary verbs, first-name verbs, adjectives and auxiliary verbs are reserved, and finally a plurality of words can be obtained.

In this embodiment, a plurality of words can be obtained by performing word segmentation processing on the text data obtained in step S220.

Step 232.2: drawing a vocabulary network diagram according to the vocabularies; the network nodes of the vocabulary network graph correspond to the vocabularies, edges connecting the two network nodes have attribute values, and the attribute values are determined according to the co-occurrence relation of the vocabularies.

In the specific implementation process, one vocabulary is taken as a network node, a vocabulary network graph of a plurality of vocabularies is drawn, in the graph, edges between the network node and the network node, namely between the vocabularies, have attribute values, and the attribute values are determined according to the co-occurrence relation of the vocabularies represented by the two network nodes.

In this embodiment, a vocabulary network diagram is drawn according to the plurality of vocabularies obtained in step S232.1, where a network node set of the vocabulary network diagram is composed of the plurality of vocabularies, and an edge between two network nodes is determined by analyzing a co-occurrence relationship between the vocabularies represented by any two network nodes in the network node set, that is, an edge between the two network nodes is drawn only when the vocabularies corresponding to the two network nodes co-occur in a window with a length of K, where K represents a window size, that is, the maximum number of K vocabularies co-occur.

Step 232.3: and sequencing and screening the vocabularies according to the vocabulary network diagram to obtain keywords representing the text data.

In the vocabulary network diagram, calculating the weight of each network node, namely each vocabulary, according to an iterative algorithm until convergence; and then, sequencing the weight of each vocabulary, and screening to obtain a preset number of vocabularies, wherein the preset number of vocabularies are keywords representing the text data. In practical implementation, the sorting mode and the preset number can be set according to practical situations.

Step 233: and encrypting the keywords through a symmetric encryption algorithm to obtain secret keywords.

The keywords obtained in step S232.3 are encrypted, and a symmetric encryption algorithm is specifically adopted, which has a small calculation amount, a fast encryption speed, and a high encryption efficiency, can implement high-speed encryption and decryption processing, can use a long key, has a difficult-to-crack property, can ensure privacy and security of multimodal data, and can also improve the processing speed of the method.

Step S240: and acquiring a preset knowledge graph.

Specifically, step S240 may include:

step S241: determining a basic knowledge graph through the open source knowledge graph;

the knowledge graph is a knowledge base, wherein knowledge is integrated through a data model or topology of a graph structure, the graph structure is used for visually describing knowledge entities in the form of nodes, and relationships among the knowledge entities are also visually described in the form of edges, so that the association among the knowledge is explicitly described.

Fig. 4 is a schematic diagram of the preset knowledge graph of the present embodiment. In the figure, there are a plurality of entities, and one entity represents one node. In this embodiment, an example language based on the setting is english, and the open source knowledge graph Wikidata is used as a basic knowledge base, and the open source knowledge graph is a large database, stores massive information in wikipedia and Freebase, has description capability of common things, and can meet the requirements of the embodiment.

Step S242: encrypting the basic knowledge graph to obtain a preset knowledge graph; and the encryption algorithm adopted by the encryption processing is consistent with the encryption algorithm adopted by the encryption of the text data.

And the basic knowledge graph is encrypted, so that the privacy of data is guaranteed. It should be noted that the encryption mode of the encryption process here is consistent with the encryption mode adopted for encrypting the keywords in step S233, and the used keys are also consistent, so that the subsequent secret-state keywords can be successfully matched with the preset knowledge graph.

The symmetric encryption method is used for encrypting the text keywords and the knowledge graph, so that the data privacy safety can be guaranteed, and the matching capability of the secret keywords can be kept.

Step S250: and segmenting the preset knowledge graph according to the dense-state keywords to obtain a dense-state subgraph.

Specifically, step S250 may include:

step S251: according to the secret key words, entities corresponding to the secret key words are matched in the preset knowledge graph to obtain knowledge nodes;

and according to the dense-state key words, entity matching is carried out in a preset knowledge graph, namely, entities corresponding to the dense-state key words are searched in the preset knowledge graph, and the entities are determined as knowledge nodes. When a plurality of dense keywords are available, a plurality of knowledge nodes can be obtained, and the knowledge nodes may not be associated or may be associated in a preset knowledge graph.

In this embodiment, taking a dense keyword as an example, in the knowledge graph shown in fig. 4, an entity 1 is used to represent a knowledge node corresponding to the dense keyword.

Step S252: in the preset knowledge graph, the knowledge nodes are used as centers, and segmentation is carried out according to preset cutting distances to obtain dense subgraphs; the length unit of the preset cutting distance is an edge between two entities, and the dense subgraph is a set of the entities and the edge within the preset cutting distance range with the knowledge node as the center.

Specifically, the knowledge node is used as a center, and the sub-knowledge graph is segmented according to a preset cutting distance to obtain a dense-state subgraph corresponding to the dense-state keyword, namely, a set of entities and edges within a preset cutting distance range with the knowledge node as the center. And each dense-state keyword can obtain a corresponding dense-state subgraph. As shown in fig. 4, the length unit of the preset clipping distance is an edge R between two entities, that is, an association relationship between the two entities, and each edge is represented by R1, R2, …, Rm in a different sequence. In this embodiment, in the knowledge graph of fig. 4, the entity 1 is taken as a center, and is segmented according to a preset clipping distance, so as to obtain a dense subgraph diagram as shown in fig. 5. Fig. 5(a) shows a dense subgraph obtained when the clipping distance is 1 unit, and when clipping, the entity 1 is taken as a center, and the entity and the edge which can be reached by only one edge between the entity 1 are divided, and the obtained set of the entity and the edge is the dense subgraph of the dense keyword represented by the entity 1; fig. 5(b) shows a dense subgraph obtained when the clipping distance is 2 units, and when clipping, the entity 1 is taken as the center, and the entity and the edge which can be reached by two or less edges between the entity 1 are divided, and the obtained set of the entity and the edge is the dense subgraph of the dense keyword represented by the entity 1; similarly, fig. 5(c) shows a dense subgraph obtained when the clipping distance is 3 units.

The sub-knowledge graph with the dense-state keywords as the centers is used for expressing the dense-state keywords, and the entities and the relations with strong correlation with the dense-state keywords are used for jointly representing the dense-state keywords, so that the semantic concepts of the dense-state keywords can be more comprehensively expressed, the semantic concepts of the dense-state keywords can be accurately represented through other related entities and relations, the confusable dense-state keywords have more obvious discrimination, and the semantic relevance of multi-mode data coding can be better kept on the premise of ensuring the privacy of user data.

Step S260: and carrying out graph embedding on the dense-state subgraph to obtain a dense-state representation vector corresponding to the dense-state key words so as to obtain a semantic representation result of the multi-modal data.

Each dense-state keyword can be represented as a sub-knowledge graph, based on the sub-knowledge graph, graph embedding operation is carried out on the sub-knowledge graph to obtain a dense-state representation vector of the sub-knowledge graph, the vector is semantic representation of the dense-state keyword, and graph embedding operation can be carried out by adopting a random Walk algorithm (Deep Walk algorithm).

In this embodiment, a Deep Walk algorithm is adopted to perform graph embedding operation, the algorithm includes two parts of generation and updating, and a Random Walk Generator (Random Walk Generator) is used for generating a Random path similar to a sentence; the random walk Update Procedure (Update Procedure) is used to input a random path into the Skip-Gram model to obtain a hidden representation of the node in the dense subgraph. The specific implementation process is that the random walk generator samples a point on the dense subgraph in a uniform distribution manner to be used as the starting point of the random path, then randomly determines a current point, the next point is obtained by sampling all neighbors of the current point in a uniform distribution manner, then determines the point as the current point, repeats until the set maximum length and stops, and can also stop when the current point has no neighbors, and after stopping, connects all the points to be the random path obtained by the random walk generator, wherein the length of each random path is possibly different, but the longest random path does not exceed the set maximum length; in this way, a random path is generated for each node in the dense subgraph. After a random path is generated, a random walk updating program regards the random walk updating program as a sentence, the sentence is input into a Skip-Gram model, a target function is calculated, and hidden representation of nodes in a dense subgraph is updated, wherein the target function is as follows:

j(Φ)＝-logP_r(u_k|v_j；Φ)，

wherein v is_jDenotes the j-th point, u, on the random path_kIndicating that v is not included in the dense subgraph_jNode of, P_r(u_k|v_j(ii) a Φ) represents at a given v_jUnder the conditions of (a) u_kThe conditional probability of occurrence, Φ, represents a trainable parameter.

According to the method, on the premise of guaranteeing the privacy of the user data and the correlation of the modal data, the privacy semantic representation of the cross-modal data is realized.

For further details of the implementation of steps S210 to S260, reference may be made to the description of the implementation based on steps S110 to S150 in the first embodiment, and for brevity of the description, no further description is given here.

The cross-modal privacy semantic representation method provided by the embodiment can ensure semantic association among the dense-state keywords, and the supported modes can be dynamically increased or decreased according to business requirements, so that the flexibility and semantic accuracy of subsequent privacy semantic retrieval are ensured, the retrieval accuracy is ensured, and the method has important significance for meeting the retrieval requirements of user privacy safety and cross-modal data.

EXAMPLE III

Based on the same inventive concept, referring to fig. 6, a first embodiment of the cross-modal privacy semantic representation apparatus according to the present invention is provided, where the cross-modal privacy semantic representation apparatus may be a virtual apparatus and is applied to a cross-modal privacy semantic representation device.

The following describes in detail the cross-modal privacy semantic representation apparatus provided in this embodiment with reference to a functional module schematic diagram shown in fig. 6, where the apparatus may include:

the data acquisition module is used for acquiring multi-modal data;

Further, the multi-modal data comprises data information of at least two different modalities; the text description module may include:

a first data processing unit, configured to convert, when the multi-modal data includes first-modality data of a speech modality, the first-modality data into first text data using a speech recognition technique;

the second data processing unit is used for converting the second modal data into second text data by utilizing a trained text generation model when the multi-modal data comprises the second modal data of a video mode;

a third data processing unit for directly determining the third modality data as third text data when the multi-modality data includes third modality data of a text modality.

Still further, the keyword extraction module may include:

and the first keyword extraction unit is connected with the first data processing unit, the first data processing unit and/or the third data processing unit and is used for extracting and encrypting keywords of the first text data, the second text data and/or the third text data to obtain secret keywords.

Further, the keyword extraction module may include:

the keyword extraction submodule is used for extracting keywords from the text data through an unsupervised learning algorithm to obtain keywords;

and the encryption submodule is used for carrying out encryption processing on the key words through a symmetric encryption algorithm to obtain secret key words.

Still further, the keyword extraction sub-module may include:

the splitting unit is used for performing word segmentation processing on the text data to obtain a plurality of words;

the drawing unit is used for drawing a vocabulary network diagram according to the vocabularies; the network nodes of the vocabulary network graph correspond to the vocabularies, edges connecting the two network nodes have attribute values, and the attribute values are determined according to the co-occurrence relation of the vocabularies;

and the screening unit is used for sequencing and screening the vocabularies according to the vocabulary network diagram to obtain keywords representing the text data.

Further, the apparatus may further include:

the preset knowledge graph acquisition module is used for determining a basic knowledge graph through the open source knowledge graph; encrypting the basic knowledge graph to obtain a preset knowledge graph; and the encryption algorithm adopted by the encryption processing is consistent with the encryption algorithm adopted by the encryption of the text data.

Further, the atlas segmentation module may include:

the matching unit is used for matching entities corresponding to the secret key words in the preset knowledge graph according to the secret key words to obtain knowledge nodes;

the segmentation unit is used for performing segmentation according to a preset cutting distance by taking the knowledge node as a center in the preset knowledge graph to obtain a dense-state subgraph; the length unit of the preset cutting distance is an edge between two entities, and the dense subgraph is a set of the entities and the edge within the preset cutting distance range with the knowledge node as the center.

It should be noted that, for the functions that can be realized by each module in the cross-modal privacy semantic representation apparatus provided by this embodiment and the technical effects that can be correspondingly achieved by each module in the cross-modal privacy semantic representation apparatus provided by this embodiment, reference may be made to the description of the specific implementation manner in each embodiment of the cross-modal privacy semantic representation method of the present invention, and for the sake of brevity of the description, details are not described here again.

Example four

Based on the same inventive concept, referring to fig. 2, a schematic diagram of a hardware structure of a cross-modal privacy semantic representation device according to embodiments of the present invention is shown. The present embodiment provides a cross-modal privacy semantic representation device, which may include a processor and a memory, where the memory stores a cross-modal privacy semantic representation program, and when the cross-modal privacy semantic representation program is executed by the processor, all or part of the steps of each embodiment of the cross-modal privacy semantic representation method according to the present invention are implemented.

Specifically, the cross-modal privacy semantic representation device refers to a terminal device or a network device capable of realizing network connection, and may be a terminal device such as a mobile phone, a computer, a tablet computer, a portable computer, or a network device such as a server and a cloud platform.

It will be appreciated that the device may also include a communications bus, a user interface and a network interface.

Wherein the communication bus is used for realizing connection communication among the components.

The user interface is used for connecting the client and communicating data with the client, and may include an output unit, such as a display screen, an input unit, such as a keyboard, and optionally, other input/output interfaces, such as a standard wired interface and a wireless interface.

The network interface is used for connecting the background server and performing data communication with the background server, and the network interface may include an input/output interface, such as a standard wired interface, a wireless interface, such as a Wi-Fi interface.

The memory is used to store various types of data, which may include, for example, instructions for any application or method in the cross-modal privacy semantic characterization device, as well as application-related data. The Memory may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), Random Access Memory (RAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic or optical disk, or alternatively, the Memory may be a storage device independent from the processor.

The Processor may be an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a controller, a microcontroller, a microprocessor, or other electronic components, and is configured to call the cross-modal privacy semantic representation program stored in the memory and execute the cross-modal privacy semantic representation method.

EXAMPLE five

Based on the same inventive concept, the present embodiment provides a computer-readable storage medium, such as a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App application mall, etc., wherein the storage medium stores thereon a computer program, the computer program is executable by one or more processors, and when the computer program is executed by the processors, the computer program can implement all or part of the steps of the embodiments of the cross-modal privacy semantic characterization method according to the present invention.

It should be noted that the above-mentioned serial numbers of the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.

The above description is only an alternative embodiment of the present invention, and is not intended to limit the scope of the present invention, and all modifications, equivalents and flow changes made by the present invention as described in the specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A cross-modal privacy semantic characterization method, the method comprising:

obtaining multi-modal data;

obtaining corresponding text data according to the multi-modal data;

2. The cross-modal privacy semantic representation method of claim 1 wherein the multi-modal data comprises data information of at least two different modalities;

3. The cross-modal privacy semantic characterization method of claim 2, wherein the step of performing keyword extraction and encryption on the text data to obtain a secret keyword comprises:

4. The cross-modal privacy semantic representation method of claim 1, wherein the step of performing keyword extraction and encryption on the text data to obtain a secret keyword comprises:

5. The cross-modal privacy semantic representation method of claim 4, wherein the step of extracting keywords from the text data by an unsupervised learning algorithm to obtain keywords comprises:

6. The cross-modal privacy semantic representation method according to claim 1, wherein before the step of segmenting the preset knowledge graph according to the dense-state keywords to obtain a dense-state subgraph, the method further comprises:

determining a basic knowledge graph through the open source knowledge graph;

7. The cross-modal privacy semantic representation method according to claim 1, wherein the step of segmenting the preset knowledge graph according to the dense-state keywords to obtain a dense-state subgraph comprises:

8. A cross-modal privacy semantic characterization apparatus, the apparatus comprising:

the data acquisition module is used for acquiring multi-modal data;

9. A cross-modal privacy semantic representation device comprising a memory and a processor, the memory having stored thereon a cross-modal privacy semantic representation program that, when executed by the processor, implements the cross-modal privacy semantic representation method of any one of claims 1 to 7.

10. A computer-readable storage medium having stored thereon a computer program, the computer program being executable by one or more processors to implement a cross-modal privacy semantic characterization method according to any one of claims 1 to 7.