CN115048539A

CN115048539A - Social media data online retrieval method and system based on dynamic memory

Info

Publication number: CN115048539A
Application number: CN202210971339.0A
Authority: CN
Inventors: 罗昕; 王娜; 丁陈璐; 许信顺
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2022-08-15
Filing date: 2022-08-15
Publication date: 2022-09-13
Anticipated expiration: 2042-08-15
Also published as: CN115048539B

Abstract

The invention provides a social media data online retrieval method and system based on dynamic memory, and relates to the technical field of large-scale stream data retrieval, wherein the method comprises the following steps: acquiring sample data of a plurality of turns and corresponding user tags; starting from the first round, carrying out hash function learning on sample data of each round in sequence to obtain a hash code of the sample data, and storing the hash code in a database; receiving social media data to be retrieved, mapping according to the optimized hash function to obtain a corresponding hash code, and comparing the hash code of the social media data with the hash code of sample data in a database to obtain a retrieval result. The method is suitable for the requirements of online scenes, pairwise similarity matrixes between new and old data labels in sample data of different rounds are used for guiding generation of refined pseudo labels, and a Hash loss function is determined according to the refined pseudo labels, so that the negative influence of user labels can be relieved, and the quality of the generated Hash codes is improved.

Description

Social media data online retrieval method and system based on dynamic memory

Technical Field

The invention belongs to the technical field of large-scale stream data retrieval, and particularly relates to a social media data online retrieval method and system based on dynamic memory.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art that is already known to a person of ordinary skill in the art.

In the past decades, social media data such as images, texts and videos have been growing explosively, and the demand for retrieving social media data has been increasing. The hash learning has become a popular approximate nearest neighbor technology by virtue of its advantages of fast retrieval speed, low storage consumption, etc., and it maps high-dimensional data into binary codes while maintaining the similarity of the data in the original space. In addition, the data is expressed in the form of binary codes, and the advantage of quick retrieval can be obtained, because the computer has high efficiency in processing pairwise comparison between binary codes, so that the retrieval speed can be fast.

Currently, hash learning can be divided into supervised learning, weakly supervised learning and unsupervised learning, and generation of hash codes is guided by using labels marked by experts, labels provided by users and unsupervised information respectively. Weakly supervised hash learning has attracted increasing attention because the user-provided labels are easily accessible, have diversity, and can provide additional information beyond visual features. However, the user-provided tags are not perfect compared to the clean tags marked by experts, such as tag errors, tag duplications, tag deletions, etc., which may affect the performance of the search model. In order to alleviate the negative effects of user tags, some methods have been proposed to alleviate the problem of tag imperfection by utilizing semantic information of the tags, etc. While these approaches achieve good performance, most of them are batch-based, not only increasing memory and computational cost with the arrival of streaming data, but also violating the natural attributes of streaming media generated by social media data collected in batches. Although some online weakly supervised hashing methods for streaming data have improved remarkably in recent years, they still cannot overcome the limitations of label loss and catastrophic forgetting of online scenes.

Disclosure of Invention

In order to solve the problems, the invention provides a social media data online retrieval method and system based on dynamic memory, which utilize pairwise similarity matrixes between new and old data labels in sample data of different rounds to construct refined pseudo labels, and determine a hash loss function according to the refined pseudo labels, so as to relieve the negative effects of user labels and improve the quality of generated hash codes.

In order to achieve the above object, the present invention mainly includes the following aspects:

in a first aspect, an embodiment of the present invention provides a method for social media data online retrieval based on dynamic memory, including:

acquiring sample data of a plurality of turns and corresponding user tags;

starting from the first round, carrying out hash function learning on sample data of each round in sequence to obtain a hash code of the sample data, and storing the hash code in a database; aiming at the sample data of the t-th round, constructing refined pseudo labels of the sample data of the t-th round according to pairwise similarity matrixes between the sample data of the t-th round and user labels corresponding to the sample data before the t-th round; determining a hash loss function according to the constructed refined pseudo label, optimizing relevant parameters of the hash function by minimizing the hash loss function, and obtaining a hash code of the sample data of the t round;

and receiving social media data to be retrieved, mapping according to the optimized hash function to obtain a corresponding hash code, and comparing the hash code of the social media data with the hash code of sample data in a database to obtain a retrieval result.

In one possible embodiment, the sample data includes text data, image data, and video data; after sample data of multiple rounds and corresponding user labels are obtained, before hash function learning is sequentially carried out on each round training sample, the method further comprises the following steps: and extracting the characteristics of the sample data, and carrying out one-hot coding on the user label to obtain a label representation.

In a possible implementation manner, a label matrix is determined according to the sample data of the t-th round and the label representation corresponding to the sample data before the t-th round; multiplying the transpose of the label matrix by the transpose of the label matrix to obtain a pair-wise similar matrix of the label; and carrying out standardization processing on the paired similar matrixes to obtain refined pseudo labels of the t round.

In one possible implementation, the method for determining the hash loss function includes:

determining a paired similarity matrix of the sample data in the t-th round according to a paradigm of Hash learning and the constructed refined pseudo labels, and constructing a first objective function for learning Hash codes of the sample data in the t-th round;

constructing a second objective function for learning the hash code of the sample data of the t-th round according to the pairwise similarity matrix between the representative point of the sample data of the t-th round in the memory and the sample data;

capturing the nonlinear characteristics of the sample data, and performing hash function learning by using linear regression to obtain a third target function for learning the hash code of the sample data of the t-th round;

and integrating the first objective function, the second objective function and the third objective function into a Hash loss function to obtain a final Hash loss function.

In a possible implementation manner, the distances between the refined pseudo label and each sample point in the user label are calculated, the obtained distances are sorted from small to large, a preset number of sample points arranged in the front are selected as representative points, in the hash learning process, a plurality of representative points are fixedly stored in the memory, and each round of newly selected representative points replaces the representative points in the preset part of the memory.

In one possible implementation, the sample data is processed by using the radial basis kernel function, and the nonlinear characteristics of the sample data are captured.

In a possible embodiment, an iterative optimization method is used to minimize the hash loss function, specifically: in each iteration process, only the set target variable is optimized, and other variables except the target variable in the Hash loss function are kept unchanged; and setting the partial derivative of the Hash loss function relative to the target function as zero, and solving to obtain an optimized target variable.

In a possible embodiment, the obtaining a search result by comparing the hash code of the social media data with the hash code of the sample data in the database includes: calculating the Hamming distance between the hash code of the social media data and the hash code of the sample data in the database, and outputting the sample data with preset quantity according to the Hamming distance.

In a second aspect, an embodiment of the present invention provides a social media data online retrieval system based on dynamic memory, including:

the data acquisition module is used for acquiring sample data of multiple rounds and corresponding user tags;

the hash function learning module is used for sequentially carrying out hash function learning on sample data of each round from the first round to obtain a hash code of the sample data, and storing the hash code into the database; aiming at the sample data of the t-th round, constructing refined pseudo labels of the sample data of the t-th round according to pairwise similarity matrixes between the sample data of the t-th round and user labels corresponding to the sample data of the t-1 round; determining a hash loss function according to the constructed refined pseudo label, optimizing relevant parameters of the hash function by minimizing the hash loss function, and obtaining a hash code of the sample data of the t round;

and the retrieval module is used for receiving the social media data to be retrieved, mapping according to the optimized hash function to obtain a corresponding hash code, and comparing the hash code of the social media data with the hash code of the sample data in the database to obtain a retrieval result.

In one possible implementation, the method further includes:

and the preprocessing module is used for extracting the characteristics of the sample data and carrying out unique hot coding on the user label to obtain label representation.

The above one or more technical solutions have the following beneficial effects:

(1) according to the method, paired similarity matrixes (namely label co-occurrence relation) between new and old data labels in sample data of different rounds are used for guiding generation of refined pseudo labels, and a Hash loss function is determined according to the refined pseudo labels, so that the negative influence of user labels can be relieved, and the quality of the generated Hash codes is improved.

(2) The invention provides a memory-based similarity learning strategy, samples with refined pseudo labels closest to original user labels are selected from old data and taken as representative points and stored in a memory, so that semantic relevance between new data and old data is maintained, and the problem of catastrophic forgetting of an online scene is effectively solved.

(3) The invention provides a method for minimizing the Hash loss function by adopting an iterative optimization mode, which can ensure that the learning efficiency meets the requirement of an online scene.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

FIG. 1 is a flowchart illustrating a social media data online retrieval method based on dynamic memory according to an embodiment of the present invention;

FIG. 2 is a block diagram of a social media data online retrieval method based on dynamic memory according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a social media data online retrieval system based on dynamic memory according to a second embodiment of the present invention.

Detailed Description

The invention is further described with reference to the following figures and examples.

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

In order to solve the problems of label loss and catastrophic forgetting of an online scene in the existing online weak supervision hash method, the invention provides a social media data online retrieval method and system based on dynamic memory, which mainly focus on the following three aspects: 1) aiming at social media data such as images, texts and videos, the value of a user tag is fully utilized, the negative influence of the user tag is reduced, and the quality of Hash learning is improved; 2) how to solve the problem of catastrophic forgetting of streaming data under an online scene, so that the model further improves the online learning quality; 3) how to make the time efficiency of the method meet the requirements of an online scenario, i.e. the time complexity is as low as possible, so that the method can be extended to large-scale datasets.

Example one

The embodiment provides an online social media data retrieval method based on dynamic memory, as shown in fig. 1, including the following steps:

s101: acquiring sample data of a plurality of turns and corresponding user tags;

s102: starting from the first round, carrying out hash function learning on sample data of each round in sequence to obtain a hash code of the sample data, and storing the hash code in a database; aiming at the sample data of the t-th round, constructing refined pseudo labels of the sample data of the t-th round according to pairwise similarity matrixes between the sample data of the t-th round and user labels corresponding to the sample data before the t-th round; determining a Hash loss function according to the constructed refined pseudo label, optimizing relevant parameters of the Hash function by minimizing the Hash loss function, and obtaining a Hash code of the sample data of the t round;

s103: and receiving social media data to be retrieved, mapping according to the optimized hash function to obtain a corresponding hash code, and comparing the hash code of the social media data with the hash code of sample data in a database to obtain a retrieval result.

As an optional implementation, the sample data includes text data, image data, and video data; after sample data of multiple rounds and corresponding user labels are obtained, before hash function learning is sequentially carried out on each round training sample, the method further comprises the following steps: and extracting the characteristics of the sample data, and carrying out one-hot (one-hot) coding on the user label to obtain a label representation.

In specific implementation, the sample data includes text data, image data and video data, and features of the sample data are extracted for different types of data respectively. Taking image data as an example, image feature extraction is performed by using a VGG-F depth network, and 4096-dimensional features output at the fully-connected layer fc7 are taken as visual features X of an image. And for the user label, obtaining a label matrix Y by using one-hot coding.

As an optional implementation manner, determining a tag matrix according to the sample data of the tth round and the tag representation corresponding to the sample data before the tth round; multiplying the transpose of the label matrix by the transpose of the label matrix to obtain a pair-wise similar matrix of the label; and carrying out standardization processing on the paired similar matrixes to obtain refined pseudo labels of the t round.

In a specific implementation, a pairwise similarity matrix a of labels is constructed by multiplying the transpose of the label matrix Y with itself. Here, a is also the co-occurrence matrix of the tags, and the higher the frequency with which two tags appear together, the higher their similarity.

It is noted that the matrix a is not constant because new data is constantly present and the overall similarity between labels may change accordingly. In particular, we consider the tag matrices for old and new data and define the tag similarity matrix at t rounds

Comprises the following steps:

；

wherein,

a tag matrix representing one-hot codes prior to the t-th round;

a tag matrix representing one-hot codes in the current t-th round; herein, the

Can be written as

Thus, therefore, it is

The update may be calculated as follows:

；

therefore, the temperature of the molten metal is controlled,

may only calculate the second term per round of updates, while the first term has been obtained in the previous round. For convenience of calculation, pair

Standardized by

To represent

Normalized to [0,1 ]]Similarity matrix of interval, refined pseudo label matrix of t round defined

Comprises the following steps:

；

wherein,

indicating the balance parameters. Here, ,

is a real-valued matrix, obtained by correlating

The partial derivative of (2) is set to zero, so as to obtain

Comprises the following steps:

。

as an optional implementation, the method for determining the hash loss function includes:

constructing a second objective function for learning the hash code of the sample data of the t-th round according to the pairwise similarity between the representative point of the sample data of the t-th round in the memory and the sample data;

In a specific implementation, as shown in fig. 2, the online hash learning stage mainly includes the following steps:

learning a hash code based on the similarity of the refined pseudo labels.

Paradigm following hash learning

Wherein the pairwise similarity matrix S _nn The construction of (A) is as follows:

；

wherein,

a pairwise similarity matrix representing new data at the tth round,

representing by refining the pseudo-label matrix

The obtained mixture is mixed with a solvent to obtain a mixture,

j denotes the jth column of the matrix,

representing the modulus of the vector. Then, a first objective function for learning hash codes of the sample data of the t-th round may be written as:

；

wherein,

a hyper-parameter representing the term of balance,

a 2-norm of the matrix is represented,

representing the hash code of the current t-th round.

And ② learning based on the similarity of the memory.

To solve the catastrophic forgetting problem, the present embodiment proposes a new strategy, i.e. memory-based similarity learning. Getting refined pseudo label

With original user tags

And taking the closest sample point as a representative point, specifically, calculating the distance between the refined pseudo label and each sample point in the user label, sequencing the obtained distances from small to large, and selecting the sample points arranged in the front in a preset number as the representative points. In the process of Hash learning, the memory fixedly stores n _q And (4) counting the number of representative points, and updating the content of the memory in each round, namely replacing the representative points in the partial memory with the representative points newly selected in each round.

Specifically, when the first round of data occurs, since no old data exists, only the current data block needs to be used to guide hash learning, and thus the memory of the first round is empty. For other data rounds, under the condition of not losing generality, the process of selecting the representative point under the t-th round is as follows: after the t-1 st round of training is finished,

from n to ₁ An assistant

Of randomly selected points and n ₂ A representative point selected from the t-1 th round, wherein n ₁ And n ₂ Is a hyperparameter, and n ₁ +n ₂ =n _q . After each round of training, the data is continuously updated

The similarity between the new data and the old data can be always maintained.

When a new round of data is subjected to hash learning, information stored in a memory is acquired first, and semantic association between new data and old data is maintained by using the information. In particular, the following similarities are defined:

；

wherein,

representing the pairwise similarity matrix between the data points in memory and the new data for the tth round,

a refined pseudo-label matrix representing representative points selected from the old data after normalization. Thus, the second objective function corresponding to the similarity between the representative point in memory and the new data sample point can be expressed as:

；

wherein,

which is indicative of a balance-out-of-parameter,

the hash code representing the corresponding point of the representative point in the memory is known information stored in the memory.

And thirdly, learning a hash function.

And processing the sample data by using a Radial Basis Function (RBF) to capture the nonlinear characteristics of the sample data. In particular, visual feature X is kernel-function

Processing to capture non-linear features, i.e.

；

Wherein,

representing anchor points randomly selected in the first round of training data, m representing the number of anchor points selected,

representing the kernel width.

Using classical linear regression for hash function learning, the associated third objective function can be written as:

；

wherein mu represents a balance hyperparameter,

representing a hyper-parameter that prevents over-fitting,

representing a hash function used to generate a hash code for a test sample.

Integrating the target functions of the three parts into a Hash loss function to obtain a final Hash loss function:

wherein,

representing approximations to discrete hash codes

By introducing a real-valued matrix

The solution of the hash code can be simplified. In addition, uncorrelated constraints (

) And bit balance constraint: (

) The hash code can be made to have more discrimination performance.

As an optional implementation manner, an iterative optimization manner is adopted to minimize the hash loss function, specifically: in each iteration process, only the set target variable is optimized, and other variables except the target variable in the Hash loss function are kept unchanged; and setting the partial derivative of the Hash loss function relative to the target function as zero, and solving to obtain an optimized target variable.

The specific optimization strategy is as follows:

the first step is as follows: fixed variable

Updating variables

. Relating an objective function to

The partial derivative of (a) is set to zero,

the update of (1) is:

；

wherein,

，

because of

，

Therefore, the temperature of the molten steel is controlled,

，

thus by storing an intermediate variable C ₁ And C ₂ And only the items containing new data are calculated in each round of updating, and the items containing old data do not need to be calculated, so that the learning rate is increased.

The second step is that: fixed variable

Updating variables

. When other variables are fixed, the objective function can be rewritten as:

；

the optimization was simplified by extending the Frobenius norm, with the following results:

；

wherein,

. By passing

To reduce the time complexity and, therefore,

；

this yields a closed-form solution:

；

the third step: fixed variable

Updating variables

. After the other variables are fixed, the process is completed,

the optimal solution of (c) can be written as:

；

wherein,

the present embodiment is achieved by

To reduce the time complexity and, therefore,

；

the optimization problem can be optimized as follows:

first, to

Performing eigenvalue decomposition, wherein the solution is as follows:

；

wherein,

，

the square root of the non-zero eigenvalues,

respectively, are eigenvectors corresponding to non-zero and zero eigenvalues. Subsequent calculation

，

The number of non-zero eigenvalues.

Initially set to a random matrix and then subjected to Gram-Schmidt orthogonalization. Finally obtaining

The solution of (a):

；

wherein,

representing the square root of the number of samples of the current t-th round.

In the retrieval process, in the t round, when the social media data to be retrieved arrives, the hash code of the social media data is inquired

Can be calculated by the following formula:

；

and calculating the hamming distance between the hash code of the social media data and the hash code of the sample data in the database by using the hash code, measuring the similarity between the two, and returning the sample data with preset quantity according to the hamming distance. For example, the sample data in the database is sorted according to the hamming distance, and a preset number of sample data with a shorter hamming distance are returned according to the requirement.

Example two

The embodiment of the invention also provides a social media data online retrieval system based on dynamic memory, which comprises:

the hash function learning module is used for sequentially carrying out hash function learning on sample data of each round from the first round to obtain a hash code of the sample data, and storing the hash code into the database; aiming at the sample data of the t-th round, constructing refined pseudo labels of the sample data of the t-th round according to pairwise similarity matrixes between the sample data of the t-th round and user labels corresponding to the sample data before the t-th round; determining a Hash loss function according to the constructed refined pseudo label, optimizing relevant parameters of the Hash function by minimizing the Hash loss function, and obtaining a Hash code of the sample data of the t round;

The social media data online retrieval system based on dynamic memory provided in this embodiment is used to implement the social media data online retrieval method based on dynamic memory, so the specific implementation manner of the social media data online retrieval system based on dynamic memory can be found in the foregoing embodiment section of the social media data online retrieval method based on dynamic memory, and is not described herein again.

In a specific implementation, as shown in fig. 3, the hash function learning module mainly includes two parts: the system comprises a refined pseudo label matrix learning module and an online hash learning module. In a refined pseudo tag learning module, in order to reduce the negative influence of the user tags, a refined pseudo tag matrix is constructed based on the pairwise similarity matrix between the tags. The improved pseudo label matrix can better reveal the association between the samples and the labels and guide the learning of the hash code. In an online hash learning module, in order to solve the problem of catastrophic forgetting, a memory-based similarity learning strategy is proposed to learn hash codes, specifically, in each training round, some most typical data points are selected from old data to update a memory, then the similarity between the data in the memory and new data is calculated to maintain the correlation between the new data and the old data, and the new data and the old data are embedded into an objective function, so that the hash codes corresponding to each instance are obtained. In addition, the embodiment provides an efficient discrete online optimization algorithm, and the time complexity of the algorithm is linearly related to the size of new data, so that the model is easily expanded to a large-scale data set.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A social media data online retrieval method based on dynamic memory is characterized by comprising the following steps:

acquiring sample data of a plurality of turns and corresponding user tags;

2. The dynamic memory-based online retrieval method of social media data as claimed in claim 1, wherein the sample data comprises text data, image data and video data; after sample data of multiple rounds and corresponding user labels are obtained, before hash function learning is sequentially carried out on each round training sample, the method further comprises the following steps: and extracting the characteristics of the sample data, and carrying out one-hot coding on the user label to obtain a label representation.

3. The social media data online retrieval method based on dynamic memory of claim 2, wherein a tag matrix is determined according to tag representations corresponding to the sample data of the tth round and the sample data before the tth round; multiplying the transpose of the label matrix by the transpose of the label matrix to obtain a pair-wise similar matrix of the label; and carrying out standardization processing on the paired similar matrixes to obtain refined pseudo labels of the t round.

4. The method for online retrieval of social media data based on dynamic memory as claimed in claim 1, wherein the method for determining the hash loss function comprises:

5. The dynamic memory-based online social media data retrieval method as claimed in claim 4, wherein the distances between the refined pseudo tags and the sample points in the user tags are calculated, the obtained distances are sorted in the order from small to large, the sample points in the preset number arranged at the front are selected as the representative points, in the hash learning process, the memory stores the representative points in the preset number fixedly, and the representative points in the preset part of the memory are replaced by the representative points selected newly in each round.

6. The method of claim 4, wherein the sample data is processed using a radial basis function to capture non-linear features of the sample data.

7. The social media data online retrieval method based on dynamic memory as claimed in claim 1, wherein an iterative optimization manner is adopted to minimize the hash loss function, specifically: in each iteration process, only the set target variable is optimized, and other variables except the target variable in the Hash loss function are kept unchanged; and setting the partial derivative of the Hash loss function relative to the target function as zero, and solving to obtain an optimized target variable.

8. The online social media data searching method based on dynamic memory as claimed in claim 1, wherein the obtaining of the search result by comparing the hash code of the social media data with the hash code of the sample data in the database comprises: and calculating the Hamming distance between the hash code of the social media data and the hash code of the sample data in the database, and returning the sample data with preset quantity according to the Hamming distance.

9. A social media data online retrieval system based on dynamic memory, comprising:

10. The social media data online retrieval system based on dynamic memory of claim 9, further comprising:

and the preprocessing module is used for extracting the characteristics of the sample data and carrying out one-hot coding on the user label to obtain a label representation.