CN111241106B

CN111241106B - Approximation data processing method, device, medium and electronic equipment

Info

Publication number: CN111241106B
Application number: CN202010044200.2A
Authority: CN
Inventors: 冯晨; 王健宗; 彭俊清
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-01-15
Filing date: 2020-01-15
Publication date: 2023-08-29
Anticipated expiration: 2040-01-15
Also published as: WO2021143016A1; CN111241106A

Abstract

The disclosure relates to the field of data processing, and discloses an approximate data processing method, an approximate data processing device, a medium and electronic equipment. The method comprises the following steps: acquiring data to be processed; acquiring a vector corresponding to data to be processed; carrying out hash operation on the vector of the data to be processed by utilizing each position sensitive hash function in the position sensitive hash function family to obtain a mapping value corresponding to the vector of the data to be processed; repeating the step of constructing the overlay group a first predetermined number of times to obtain a plurality of overlay groups, the step of constructing the overlay group comprising constructing the overlay group based on the mapping values corresponding to the vectors of the data to be processed and a position-sensitive hash function that hashes the vectors of the data to be processed; integrating a plurality of coverage groups to obtain the final coverage of the data to be processed, wherein the data to be processed belonging to the same final coverage is approximate data. According to the method, the condition that the time consumption for processing a large amount of approximate data is unstable is avoided, and the data processing efficiency is improved as a whole.

Description

Approximation data processing method, device, medium and electronic equipment

Technical Field

The disclosure relates to the technical field of data processing, and in particular relates to an approximate data processing method, an approximate data processing device, a medium and electronic equipment.

Background

Currently, in order to quickly find data similar to any item of data when data processing is performed, a commonly used scheme is position sensitive hashing (Locality sensitive Hashing, LSH), which maps high-dimensional data to low-dimensional data, and maps similar data into the same bucket, so that the probability that two adjacent data points in the original data space are adjacent in the mapped new data space is still high, and the probability that two non-adjacent data points are adjacent in the mapped new data space is low. However, the use of the LSH algorithm involves the giving of a plurality of super-parameters. The random numbers in the Hash function are included, so that the effect of mapping different piles has a great relation with the given random numbers, and when the mapping result based on the LSH algorithm is utilized to perform subsequent data processing tasks, if a large amount of data needs to be processed, the mapping result is required to be high, which causes a certain instability. On one hand, if the data volume in the barrel is too large, the effect of improving the efficiency by utilizing the LSH algorithm is greatly reduced; on the other hand, for the same set of data, the time taken to perform the data processing tasks may be indeterminate, affected by the size of the amount of data in the heap.

Disclosure of Invention

In order to solve the above technical problems in the data processing technical field, an object of the present disclosure is to provide an approximate data processing method, an apparatus, a medium and an electronic device.

According to an aspect of the present disclosure, there is provided an approximation data processing method including:

acquiring a plurality of data to be processed;

acquiring a vector corresponding to the data to be processed;

carrying out hash operation on the vector of the data to be processed by utilizing each position-sensitive hash function in a preset position-sensitive hash function family to obtain a mapping value corresponding to the vector of the data to be processed, wherein the preset position-sensitive hash function family comprises a plurality of position-sensitive hash functions;

repeating the step of constructing an overlay group a first predetermined number of times to obtain a plurality of overlay groups, wherein the step of constructing an overlay group comprises constructing an overlay group based on the mapping value corresponding to the vector of the data to be processed and a position sensitive hash function for performing hash operation on the vector of the data to be processed, the overlay group comprising at least one overlay, each overlay comprising at least one data to be processed;

and integrating the plurality of coverage groups to obtain the final coverage of each piece of data to be processed, wherein the data to be processed belonging to the same final coverage is approximate data.

According to another aspect of the present disclosure, there is provided an approximation data processing apparatus, the apparatus comprising:

the first acquisition module is configured to acquire a plurality of data to be processed;

a second acquisition module configured to acquire a vector corresponding to the data to be processed;

the hash module is configured to perform hash operation on the vector of the data to be processed by utilizing each position-sensitive hash function in a preset position-sensitive hash function family to obtain a mapping value corresponding to the vector of the data to be processed, wherein the preset position-sensitive hash function family comprises a plurality of position-sensitive hash functions;

a repeated execution module configured to repeatedly execute a first predetermined number of steps of constructing an overlay group, resulting in a plurality of overlay groups, the steps of constructing an overlay group comprising at least one overlay, each overlay comprising at least one of the data to be processed, based on the mapping values corresponding to the vectors of the data to be processed and a position-sensitive hash function that hashes the vectors of the data to be processed;

and the integration module is configured to integrate the plurality of coverage groups to obtain a final coverage of each data to be processed, wherein the data to be processed belonging to the same final coverage is approximate data.

According to another aspect of the present disclosure, there is provided a computer readable program medium storing computer program instructions which, when executed by a computer, cause the computer to perform the method as described above.

According to another aspect of the present disclosure, there is provided an electronic device including:

a processor;

a memory having stored thereon computer readable instructions which, when executed by the processor, implement a method as described above.

The technical scheme provided by the embodiment of the disclosure can comprise the following beneficial effects:

the approximate data processing method provided by the disclosure comprises the following steps: acquiring a plurality of data to be processed; acquiring a vector corresponding to the data to be processed; carrying out hash operation on the vector of the data to be processed by utilizing each position-sensitive hash function in a preset position-sensitive hash function family to obtain a mapping value corresponding to the vector of the data to be processed, wherein the preset position-sensitive hash function family comprises a plurality of position-sensitive hash functions; repeating the step of constructing an overlay group a first predetermined number of times to obtain a plurality of overlay groups, wherein the step of constructing an overlay group comprises constructing an overlay group based on the mapping value corresponding to the vector of the data to be processed and a position sensitive hash function for performing hash operation on the vector of the data to be processed, the overlay group comprising at least one overlay, each overlay comprising at least one data to be processed; and integrating the plurality of coverage groups to obtain the final coverage of each piece of data to be processed, wherein the data to be processed belonging to the same final coverage is approximate data.

According to the method, the coverage groups are constructed for multiple times, and the coverage groups are integrated, so that the time consumption of approximate data processing is stabilized in a smaller range while the accuracy of the data processing is maintained, the conditions that the time consumption of processing a large amount of approximate data is unstable and the time consumption is possibly excessive are avoided, and the data processing efficiency is improved as a whole.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a system architecture diagram illustrating an approximation data processing method, according to an example embodiment;

FIG. 2 is a flowchart illustrating a method of approximation data processing, according to an example embodiment;

FIG. 3 is a detailed flow diagram of step 220 according to an embodiment illustrated in the corresponding embodiment of FIG. 2;

FIG. 4 is a schematic diagram of a coverage group shown in accordance with an exemplary embodiment;

FIG. 5 is a flowchart illustrating steps for constructing an overlay group when the data to be processed is voiceprint data according to one embodiment illustrated in the corresponding embodiment of FIG. 2;

FIG. 6 is a block diagram of an approximation data processing apparatus, shown in accordance with an illustrative embodiment;

FIG. 7 is an exemplary block diagram of an electronic device implementing the above-described approximation data processing method, according to an exemplary embodiment;

fig. 8 is a computer readable storage medium embodying the above-described approximate data processing method according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the invention. Rather, they are merely examples of apparatus and methods consistent with aspects of the invention as detailed in the accompanying claims.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities.

The present disclosure first provides an approximation data processing method. The data here may be any data that can be converted into vectors, such as data of the type that can be audio, text, image, etc. The approximate data processing method is a method for classifying as much more similar data as possible in a plurality of data, and after classifying the possibly approximate data, the approximate data can be used for executing tasks such as searching, further accurate classification and the like.

The implementation terminal of the present disclosure may be any device having an operation and processing function, where the device may be connected to an external device, and used for receiving or sending data, and specifically may be a portable mobile device, such as a smart phone, a tablet computer, a notebook computer, PDA (Personal Digital Assistant), or a fixed device, such as a computer device, a field terminal, a desktop computer, a server, a workstation, or the like, or may be a collection of multiple devices, such as a physical infrastructure of cloud computing or a server cluster.

Preferably, the implementation terminal of the present disclosure may be a server or a physical infrastructure of cloud computing.

FIG. 1 is a system architecture diagram illustrating an approximation data processing method according to an example embodiment. As shown in fig. 1, the system architecture includes a server 110 and a user terminal 120, where the server 110 is connected to the user terminal 120 through a communication link, and can receive data sent by the user terminal 120 and send the data to the user terminal 120, and in this embodiment, the server 110 is an implementation terminal of the disclosure. After a user uses the user terminal 120 to send a plurality of data to the server 110, the server 110 may classify the received data by executing the approximate data processing method provided in the present disclosure, so that the data that may be similar is classified into one type, thereby providing support for data classification results for performing other tasks such as searching, accurate classification, and the like.

It should be noted that fig. 1 is only one embodiment of the present disclosure. Although the implementation terminal in the present embodiment is a server, in other embodiments, the implementation terminal may be various terminals or devices as described above; although in the present embodiment, the data for performing data processing is sent from only one terminal, in other embodiments or specific applications, the data for performing data processing may be obtained from multiple terminals, for example, a Server and a user terminal are in a C/S (Client/Server) architecture or a B/S (Browser/Server) architecture, and multiple user terminals use clients or browsers installed thereon to send data to the Server, and the data sources on the user terminals may also be various. The present disclosure is not limited thereto, nor should the scope of the present disclosure be limited thereby.

FIG. 2 is a flowchart illustrating a method of approximation data processing, according to an example embodiment. The approximate data processing method of the present embodiment may be executed by a server, as shown in fig. 2, including the steps of:

step 210, obtaining a plurality of data to be processed.

As described above, the data to be processed may be various types of data, such as image data, voice data, text data, and the like.

Step 220, obtaining a vector corresponding to the data to be processed.

In one embodiment, the data to be processed is image data, and the image data is converted into vectors according to pixel values of pixel points included in each image data.

In one embodiment, each of the data to be processed corresponds to a vector, and the data to be processed is voiceprint data, and the specific steps of step 220 may be as shown in fig. 3. FIG. 3 is a detailed flow chart of step 220 according to one embodiment shown in the corresponding embodiment of FIG. 2, including the steps of:

step 221, obtaining the mel-frequency cepstrum coefficient characteristic value of the data to be processed.

In one embodiment, the Mel-frequency cepstral coefficient (Mel-Frequency Cepstral Coefficients, MFCCs) eigenvalues are speech eigenvalues obtained by performing a series of processes such as pre-emphasis, framing, windowing, fourier transform, inverse fourier transform, etc. on the speech data.

Step 222, inputting the characteristic value of the mel-frequency cepstrum coefficient of each piece of data to be processed into a pre-trained gaussian mixture-universal background model combined with a joint factor analysis model, and obtaining an identity confirmation vector corresponding to each piece of data to be processed.

The joint factor analysis (Joint Factor Analysis, JFA) model models the channel differences and the differences between different speaker data, removes interference components of the channel, and realizes more accurate extraction of voiceprint features in voiceprint data.

The Gaussian mixture-generic background model (Gaussian mixture model-Universal Background Model, GMM-UBM) is a model that can recognize similar speech data, and training of the GMM-UBM model refers to the process of determining parameters of the model.

In one embodiment, training of the GMM-UBM model is achieved by inputting a plurality of voiceprint data pre-labeled with corresponding speakers into the GMM-UBM model.

The Identity verification Vector I-Vector (Identity-Vector) is a Vector in which voice speaker-specific voiceprint feature information is recorded.

Step 230, performing hash operation on the vector of the data to be processed by using each position sensitive hash function in the preset position sensitive hash function family to obtain a mapping value corresponding to the vector of the data to be processed.

Wherein the predetermined family of location sensitive hash functions includes a plurality of location sensitive hash functions.

The location sensitive hash (locality sensitive hashing, LSH) function is a function that can reduce high-dimensional data to a low-dimensional space, and the LSH function can map adjacent data points in the original data space with a high probability of being adjacent, while non-adjacent data points in the original data space are mapped with a high probability of being non-adjacent.

In one embodiment, each location-sensitive hash function in the predetermined family of location-sensitive hash functions is established by the following formula:

wherein a is a random number sequence, b is a random number in (0, r), r is the difference between the maximum value and the minimum value in each characteristic of the identity confirmation vector of the data to be processed, and x is the identity confirmation vector of the data to be processed, and the establishment of a preset position sensitive hash function family comprising a plurality of position sensitive hash functions is realized by adjusting two parameters of a and b.

h(x)＝sgn(v，r)，

wherein r is a random hyperplane, v is an identity confirmation vector of the data to be processed, and sgn is a sign function.

wherein E is ₈ Is a lattice decoding function, v8 is 8-dimensional data randomly fetched from a vector v, v is an identity confirmation vector of the data to be processed, b is an 8-dimensional random offset vector, and w is a normalization factor.

Step 240, repeating the step of constructing the coverage group for a first predetermined number of times, so as to obtain a plurality of coverage groups, wherein the step of constructing the coverage group includes constructing the coverage group based on the mapping value corresponding to the vector of the data to be processed and a position sensitive hash function for performing hash operation on the vector of the data to be processed.

The overlay group comprises at least one overlay, each overlay comprising at least one of the data to be processed.

The overlay (canopy) is essentially a collection of data to be processed, each data to be processed may belong to at least one overlay, the data to be processed that is split into the same overlay being considered to be approximate data. The coverage group is a group of coverage, is a collection of coverage, and may include at least one coverage. If the data to be processed is uniquely identified by an index, then the overlay may be a set of indices since the index uniquely corresponds to the data to be processed. For example, if all indexes of all data to be processed are {1,2,3,4}, one coverage group can be obtained as { [1,2], [3], [4] }, where [1,2], [3], [4] are one coverage respectively, and another coverage group can be obtained as { [1,2,3], [3,4], [4] }, where [1,2], [3], [4] are one coverage respectively.

The first predetermined number may be any number greater than 2 set based on human experience, for example, the first predetermined number may be 10. Since the coverage groups are constructed a plurality of times, each of which may or may not be identical to other established coverage groups, there may be cases where two or more coverage groups are identical, and if the same plurality of coverage groups are regarded as one, the number of coverage groups obtained by repeating the step of constructing the coverage groups a first predetermined number of times is less than or equal to the first predetermined number.

Fig. 4 is a schematic diagram of one coverage group shown in accordance with an exemplary embodiment. As shown in fig. 4, the overlay group includes a first overlay 410, a second overlay 420, and a third overlay 430, where black dots in each overlay represent data to be processed belonging to the overlay, and it can be seen that each overlay includes at least one data to be processed, for example, the first data to be processed 440 is the data to be processed belonging to the second overlay 420. The first overlay 410 and the second overlay 420 have intersecting portions that represent that the second data to be processed 450 belonging to the portions belongs to more than one overlay, i.e. to both the first overlay 410 and the second overlay 420.

Fig. 5 is a flowchart of the steps of constructing an overlay group when the data to be processed is voiceprint data according to one embodiment illustrated in the corresponding embodiment of fig. 2. Referring to fig. 5, the method comprises the following steps:

step 510, constructing an integer set comprising 1, the number of dimensions of the identity confirmation vector, and all integers therebetween.

The identity verification Vector I-Vector is a Vector extracted from voiceprint data, and may be the same as the identity verification Vector in the previous embodiment.

For example, if the number of dimensions of the identity confirmation vector is 8, then the final set of integers is {1,2,3,4,5,6,7,8}.

Step 520, an initial coverage group is established and a counter is set to 1.

Wherein the initial coverage group is an empty set.

The overlay group may be recorded in various data structures, such as an array, and when the initial overlay group is recorded using an array, the array corresponding to the initial overlay group is a null array.

The counter is a module or component embedded in the implementation terminal of the present disclosure that has a counting function.

Step 530, determining whether the integer set is an empty set.

The integer set is an empty set, that is, the integer set does not contain any element.

In case the integer set is an empty set, the step of constructing the overlay group is directly ended.

In the case where the integer set is not an empty set, step 540 and subsequent steps are repeated until the construction of the overlay group is completed when the integer set is an empty set.

Step 540, randomly fetching an element from the integer set as a target element.

The structured set of integers includes a plurality of integers, each integer being an element.

Step 550, for each location sensitive hash function, obtaining an index of the location sensitive hash function for the identity confirmation vector of the output result obtained by the location sensitive hash function equal to the output result obtained by the location sensitive hash function for the input of the identity confirmation vector indexed as the target element.

The index of the identity confirmation vector is the unique identification of the identity confirmation vector, and is an integer in the integer set, and each identity confirmation vector is uniquely corresponding to one index.

In this step, the fetched target element is fixed, and thus the identity confirmation vector indexed to the target element is also fixed, the step being performed based on the fixed identity confirmation vector.

And inputting the identity confirmation vector with the index of the target element into each position sensitive hash function to obtain an output result corresponding to the identity confirmation vector by each position sensitive hash function, and when the output result obtained by inputting other identity confirmation vectors into the position sensitive hash functions is the same as the output result obtained by inputting the identity confirmation vector with the index of the target element into the same position sensitive hash function, considering that the identity confirmation vectors are similar to the identity confirmation vector with the index of the target element.

Step 560, adding the intersection of the union of all the obtained identity confirmation vector indexes and the integer set as the coverage with the index of the value of the counter to the initial coverage group.

The obtained union of all the identity confirmation vector indexes can ensure that only one repeated identity confirmation vector index is reserved, thereby avoiding that one coverage group comprises a plurality of identical identity confirmation vectors.

The index, the identity confirmation vector and the data to be processed are in one-to-one correspondence, so that the corresponding identity confirmation vector and the data to be processed can be effectively divided by establishing the coverage group in an index mode.

As the overlay is built, elements in the integer set may decrease, elements that should not be included in the overlay of the current build may exist in the union of all the obtained identity confirmation vector indexes, and they need to be removed from the overlay of the current build, and elements that are not in the integer set may be removed from the overlay of the current build by way of intersection.

Step 570, determining a similarity score for each identity confirmation vector in the overlay whose index is the value of the counter, the similarity score being equal to a ratio of a number of indexes of the location-sensitive hash functions used to determine the index of the identity confirmation vector to a number of all location-sensitive hash functions.

Since the indexes of the hash function are uniquely corresponding to the hash function, the number of hash functions is equivalent to the number of indexes of the hash function.

The overlay indexed by the value of the counter is the overlay to be constructed for the round.

As mentioned before, by means of the union, only one repeated element in the overlay can be kept, but the number of repeated elements also has its effect, and for the same identity confirmation vector, if there are more output results of the location sensitive hash functions on the identity confirmation vector equal to the output results of the location sensitive hash functions on the identity confirmation vector indexed to the target element, respectively, this means that the more similar this identity confirmation vector is to the identity confirmation vector indexed to the target element, so this ratio can be used as a similarity score.

Step 580 removes from the integer set all indexes of identity confirmation vectors having similarity scores greater than or equal to a predetermined similarity threshold.

When the similarity score of an identity verification vector and an identity verification vector indexed to the target element is sufficiently large, it is stated that the identity verification vector and the identity verification vector indexed to the target element are sufficiently similar, and by removing the index of such an identity verification vector from the integer set, it is avoided that the index of the identity verification vector which should exist only in one overlay becomes an element in the other overlay, so that the accuracy of the overlay division can be ensured.

Step 590, increment the counter by 1.

As described above, in step 560, the index of the constructed overlay is equal to the value of the counter, so the counter is used to count the number of constructed overlays and set the index for the constructed overlay, and therefore, the counter needs to be incremented by 1, each time the counter is incremented by 1, which means that the link of the overlay of the present construction in the process of building the overlay group of the present round is ended, the construction of the next overlay needs to be performed, and by incrementing the counter by 1, the value of the counter can be made available for constructing the next overlay.

It is easy to understand that the unique identifier of the index is in one-to-one correspondence with both the data to be processed and the identity confirmation vector, so that the coverage group can contain the data to be processed, the identity confirmation vector and the index, and the purpose of classifying the data to be processed can be achieved no matter which one of the data to be processed is contained.

If two identity confirmation vectors are mapped to the same result by the same location sensitive hash function, the two identity confirmation vectors can be considered to be similar in high probability, in this embodiment, by constructing the overlay by using this mechanism, the similarity of each element in the constructed overlay is ensured, and in addition, by removing the index of the identity confirmation vector from the integer set based on the similarity, the accuracy of the overlay division is ensured.

And step 250, integrating the plurality of coverage groups to obtain a final coverage to which each data to be processed belongs.

Wherein the data to be processed belonging to the same final coverage is approximate data.

Integrating multiple coverage groups is a process of integrating distribution conditions of data to be processed in multiple coverage groups to form one coverage group, and the finally formed coverage group comprises multiple final coverage, and each data to be processed can belong to at least one final coverage.

For each overlay group, the data to be processed in each overlay under the overlay group can be considered to be approximate, and the final overlay is the final classification result of similar data to be processed, and the final overlay reflects the distribution condition of the approximate data to be processed in each overlay group as a whole.

In one embodiment, the integrating the plurality of coverage groups to obtain a final coverage to which each of the data to be processed belongs includes:

if the data to be processed belongs to the same coverage in the coverage group exceeding a second preset number, classifying the data to be processed into one coverage, wherein the second preset number is smaller than the first preset number;

the data to be processed which does not belong to the same overlay in more than a second predetermined number of overlay groups are removed from the overlays respectively and all the data to be processed removed from the overlays are classified as one overlay.

In one embodiment, if the data to be processed belongs to the same overlay in more than a second predetermined number of overlay groups, classifying the data to be processed as one overlay includes:

for each index combination in each overlay containing at least two indexes, determining the number of overlay groups in the contained overlay in which the index combination exists;

and classifying the data to be processed, of which the index belongs to the index combination, into one coverage when the number is larger than a second preset number.

After classifying the data to be processed into final overlays, the approximate data to be processed in each final overlay can be applied to perform other data processing tasks.

In one embodiment, after integrating the plurality of coverage groups to obtain a final coverage to which each of the data to be processed belongs, the method further includes:

clustering the data to be processed based on the obtained final coverage containing the data to be processed to divide the data to be processed into a plurality of classes.

In this embodiment, a further clustering task of the data to be processed is mainly executed, and because most of the data to be processed in the established final coverage are similar and already almost classified, further clustering is performed on the basis, so that the clustering time can be greatly shortened, and the clustering efficiency is improved. In addition, the final coverage is obtained by integrating a plurality of coverage groups, so that the time consumed by the classification task of the whole data to be processed can be more stable.

In one embodiment, the data to be processed in each final overlay is clustered using a k-means algorithm.

In one embodiment, the clustering the data to be processed based on the obtained final coverage including the data to be processed to divide the data to be processed into a plurality of classes includes:

taking each data to be processed as a class, and determining initial class spacing between the classes based on final coverage of each data to be processed;

and repeatedly executing a classifying process until the class interval between two classes with the minimum class interval reaches a preset class interval threshold value or all data to be processed are combined into one class, wherein the classifying process comprises the following steps:

combining two classes with minimum class spacing;

class spacing between classes is updated.

In the embodiment, clustering is performed iteratively according to the class spacing, so that the reliability of clustering is ensured.

In one embodiment, the merging the two classes with the smallest class spacing includes:

obtaining class spacing between any pair of classes containing data to be processed belonging to a first final coverage, taking the class spacing as the class spacing of an initial class pair, and marking the class spacing of the initial class pair as the minimum class spacing;

judging: starting from class spacing between any pair of classes which contains the data to be processed belonging to the final coverage with the minimum index and is not marked as judged, judging whether class spacing of two classes in the pair is smaller than the minimum class spacing for each pair of classes, and marking each class in the pair as judged;

Canceling and marking: if yes, canceling the mark of the last minimum class interval and marking the class interval of the pair of classes as the minimum class interval;

repeating the judging step and the canceling and marking step until the class interval of the class pair marked as the minimum class interval is no longer changed;

the pair of classes marked as the smallest class interval is combined as the two classes with the smallest class interval.

In one embodiment, the data to be processed is voiceprint data, each data to be processed is used as a class, and the determining the initial class interval between the classes based on the final coverage to which each data to be processed belongs includes:

obtaining a similarity score of each piece of data to be processed according to a probability-based linear discriminant analysis model;

normalizing each similarity score to be between [0,1] to obtain a normalized similarity score;

for any pair of data to be processed, if the corresponding classes belong to the same coverage, setting the class interval between the classes corresponding to the pair of data to be processed as 1 and the difference between the normalized similarity scores corresponding to the pair of data to be processed, and if the corresponding classes belong to different coverage, setting the class interval between the classes corresponding to the pair of data to be processed as 1;

The updating the class spacing between classes includes:

for each pair of the merged data, obtaining the sum of normalized similarity scores between all data to be processed belonging to the first class and all data to be processed belonging to the second class;

and taking the ratio of the sum to the number of the data pairs corresponding to all the data to be processed belonging to the first class and all the data to be processed belonging to the second class in the pair as the class interval between the two classes in the pair.

In this embodiment, the clustering of the voiceprint data is achieved, in many scenarios, the same user may generate a plurality of voiceprint data, and in the case that a large number of users provide a large number of voiceprint data, the voiceprint data mixed together needs to be classified according to the users, so that the voiceprint data can be classified efficiently, accurately and stably through this embodiment.

The first class and the second class are references to two different classes in a pair of classes. For example, the first class includes A, B, C three data to be processed, and the second class includes D, E two data to be processed, so that two classes can correspond to AD, AE, BD, BE, CD, CE combinations of six data to be processed, each combination has a normalized similarity score, the sum of the normalized similarity scores obtained finally is the sum of similarity scores corresponding to the six combinations, and the ratio of the sum of the similarity scores to 6 is the class interval between the two classes.

In one embodiment, the obtaining the similarity score of each piece of data to be processed according to the probability-based linear discriminant analysis model includes:

obtaining a vector of each piece of data to be processed representing speaker information by using a probability-based linear discriminant analysis model;

and obtaining the log likelihood ratio of the vector representing the speaker information corresponding to each piece of data to be processed in the data to be processed aiming at each piece of data to be processed, and obtaining the similarity score of the data to be processed.

The probability-based linear discriminant analysis (Probabilistic Linear Discriminant Analysis, PLDA) model is a channel compensation algorithm with which the effect of channel noise on the recorded speaker's voice information can be ignored.

In this embodiment, a probability-based linear discriminant analysis model is used to obtain a vector of each piece of data to be processed representing speaker information, and a log-likelihood ratio of the vector of the speaker information corresponding to the two pieces of data to be processed is used as a score between the two pieces of data to be processed. The class spacing is obtained by using the calculated scoring, and clustering is carried out according to the class spacing, so that the dynamic fault tolerance during clustering is improved, and the occurrence of error classification caused by errors generated by a coverage integration method is avoided.

In one embodiment, normalizing each similarity score to between [0,1] results in a normalized similarity score, comprising:

and (3) normalizing each similarity score to be between [0,1] by using the following formula to obtain a normalized similarity score:

wherein X is a similarity score to be normalized, X _max X is the maximum value in the similarity score of each data to be processed _min For the minimum of the similarity scores for each pair of data to be processed,is a normalized similarity score.

In one embodiment, after clustering the data to be processed to divide the data to be processed into a plurality of classes based on the obtained final coverage including the data to be processed, the method further comprises:

when a request for acquiring data similar to the target data is received, the data similar to the target data is acquired based on the class into which each data to be processed is classified.

In this embodiment, since the data to be processed in each class obtained after the coverage integration and clustering of the data to be processed are often highly similar, if the data similar to the target data in all the data is determined on the basis of the data, the data can be directly obtained from the similar class, thereby improving the retrieval efficiency.

In summary, according to the method for processing approximate data provided in the embodiment of fig. 2, by constructing the coverage group multiple times and integrating the coverage groups, the time consumption of the approximate data processing can be stabilized in a smaller range while the accuracy of the data processing is maintained, so that the situations that the time consumption of processing a large amount of approximate data is unstable and the time consumption is possibly excessive are avoided, and thus the data processing efficiency is improved as a whole.

The present disclosure also provides an approximation data processing apparatus, the following are apparatus embodiments of the present disclosure.

Fig. 6 is a block diagram of an approximation data processing apparatus, shown in accordance with an exemplary embodiment. As shown in fig. 6, the apparatus 600 includes:

a first acquiring module 610 configured to acquire a plurality of data to be processed;

a second acquisition module 620 configured to acquire a vector corresponding to the data to be processed;

a hash module 630 configured to perform a hash operation on the vector of the data to be processed by using each location-sensitive hash function in a preset location-sensitive hash function family, to obtain a mapping value corresponding to the vector of the data to be processed, where the preset location-sensitive hash function family includes a plurality of location-sensitive hash functions;

A repeated execution module 640 configured to repeatedly execute a first predetermined number of steps of constructing an overlay group, resulting in a plurality of overlay groups, the steps of constructing an overlay group comprising at least one overlay, each overlay comprising at least one of the data to be processed, based on the mapping values corresponding to the vectors of the data to be processed and a location-sensitive hash function that hashes the vectors of the data to be processed;

the integration module 650 is configured to integrate the plurality of coverage groups to obtain a final coverage to which each of the data to be processed belongs, where the data to be processed belonging to the same final coverage is approximate data.

According to a third aspect of the present disclosure, there is also provided an electronic device capable of implementing the above method.

Those skilled in the art will appreciate that the various aspects of the invention may be implemented as a system, method, or program product. Accordingly, aspects of the invention may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.

An electronic device 700 according to this embodiment of the invention is described below with reference to fig. 7. The electronic device 700 shown in fig. 7 is merely an example, and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.

As shown in fig. 7, the electronic device 700 is embodied in the form of a general purpose computing device. Components of electronic device 700 may include, but are not limited to: the at least one processing unit 710, the at least one memory unit 720, and a bus 730 connecting the different system components, including the memory unit 720 and the processing unit 710.

Wherein the storage unit stores program code that is executable by the processing unit 710 such that the processing unit 710 performs steps according to various exemplary embodiments of the present invention described in the above-described "example methods" section of the present specification.

The memory unit 720 may include readable media in the form of volatile memory units, such as Random Access Memory (RAM) 721 and/or cache memory 722, and may further include Read Only Memory (ROM) 723.

The storage unit 720 may also include a program/utility 724 having a set (at least one) of program modules 725, such program modules 725 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

Bus 730 may be a bus representing one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 700 may also communicate with one or more external devices 900 (e.g., keyboard, pointing device, bluetooth device, etc.), one or more devices that enable a user to interact with the electronic device 700, and/or any device (e.g., router, modem, etc.) that enables the electronic device 700 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 750. Also, electronic device 700 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through network adapter 760. As shown, network adapter 760 communicates with other modules of electronic device 700 over bus 730. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 700, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a terminal device, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.

According to a fourth aspect of the present disclosure, there is also provided a computer readable storage medium having stored thereon a program product capable of implementing the method described herein above. In some possible embodiments, the various aspects of the invention may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps according to the various exemplary embodiments of the invention as described in the "exemplary methods" section of this specification, when said program product is run on the terminal device.

Referring to fig. 8, a program product 800 for implementing the above-described method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

Furthermore, the above-described drawings are only schematic illustrations of processes included in the method according to the exemplary embodiment of the present invention, and are not intended to be limiting. It will be readily appreciated that the processes shown in the above figures do not indicate or limit the temporal order of these processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, for example, among a plurality of modules.

It is to be understood that the invention is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A method of approximation data processing, the method comprising:

acquiring a plurality of pieces of data to be processed, wherein the data to be processed are voiceprint data, and each piece of data to be processed corresponds to a vector;

the vector corresponding to the data to be processed is obtained by: acquiring a Mel cepstrum coefficient characteristic value of the data to be processed; inputting the characteristic value of the Mel frequency-hopping spectral coefficient of each piece of data to be processed into a pre-trained Gaussian mixture-general background model combined with a joint factor analysis model to obtain an identity confirmation vector corresponding to each piece of data to be processed;

repeating the step of constructing a coverage group a first predetermined number of times, resulting in a plurality of coverage groups, said coverage groups comprising at least one coverage, each said coverage comprising at least one of said data to be processed, said step of constructing a coverage group comprising: constructing an integer set comprising 1, the number of dimensions of the identity confirmation vector, and all integers therebetween; establishing an initial coverage group and setting a counter to be 1, wherein the initial coverage group is an empty set; repeatedly performing a build overlay process until the integer set is an empty set, the build overlay process comprising: randomly taking out an element from the integer set as a target element; for each position sensitive hash function, acquiring an index of an identity confirmation vector of an output result obtained by utilizing the position sensitive hash function, wherein the output result is equal to the output result obtained by inputting the identity confirmation vector of the target element by the position sensitive hash function; adding the intersection of the union of all the obtained identity confirmation vector indexes and the integer set as the coverage with the index of the value of the counter to the initial coverage group; determining a similarity score for each identity confirmation vector in the overlay whose index is the value of the counter, the similarity score being equal to a ratio of the number of indexes of the location-sensitive hash functions used to determine the index of the identity confirmation vector to the number of all location-sensitive hash functions; removing from the integer set all indexes of identity confirmation vectors having similarity scores greater than or equal to a predetermined similarity threshold; the counter is added with 1;

2. The method of claim 1, wherein each location-sensitive hash function in the family of preset location-sensitive hash functions is established by the following formula:

3. The method of claim 1, wherein integrating the plurality of coverage groups to obtain a final coverage to which each of the data to be processed belongs comprises:

4. The method of claim 1, wherein after integrating the plurality of coverage groups to obtain a final coverage to which each of the data to be processed belongs, the method further comprises:

5. The method of claim 4, wherein clustering the data to be processed based on the obtained final coverage containing the data to be processed to divide the data to be processed into a plurality of classes comprises:

combining two classes with minimum class spacing;

class spacing between classes is updated.

6. An approximation data processing apparatus, the apparatus comprising:

the first acquisition module is configured to acquire a plurality of pieces of data to be processed, wherein the data to be processed is voiceprint data, and each piece of data to be processed corresponds to a vector;

A second acquisition module configured to acquire a vector corresponding to the data to be processed by: acquiring a Mel cepstrum coefficient characteristic value of the data to be processed; inputting the characteristic value of the Mel frequency-hopping spectral coefficient of each piece of data to be processed into a pre-trained Gaussian mixture-general background model combined with a joint factor analysis model to obtain an identity confirmation vector corresponding to each piece of data to be processed;

a repeating execution module configured to repeat the step of constructing a coverage group a first predetermined number of times, resulting in a plurality of coverage groups, the coverage groups including at least one coverage, each coverage including at least one of the data to be processed, the step of constructing a coverage group comprising: constructing an integer set comprising 1, the number of dimensions of the identity confirmation vector, and all integers therebetween; establishing an initial coverage group and setting a counter to be 1, wherein the initial coverage group is an empty set; repeatedly performing a build overlay process until the integer set is an empty set, the build overlay process comprising: randomly taking out an element from the integer set as a target element; for each position sensitive hash function, acquiring an index of an identity confirmation vector of an output result obtained by utilizing the position sensitive hash function, wherein the output result is equal to the output result obtained by inputting the identity confirmation vector of the target element by the position sensitive hash function; adding the intersection of the union of all the obtained identity confirmation vector indexes and the integer set as the coverage with the index of the value of the counter to the initial coverage group; determining a similarity score for each identity confirmation vector in the overlay whose index is the value of the counter, the similarity score being equal to a ratio of the number of indexes of the location-sensitive hash functions used to determine the index of the identity confirmation vector to the number of all location-sensitive hash functions; removing from the integer set all indexes of identity confirmation vectors having similarity scores greater than or equal to a predetermined similarity threshold; the counter is added with 1;

7. A computer readable program medium, characterized in that it stores computer program instructions, which when executed by a computer, cause the computer to perform the method according to any one of claims 1 to 5.

8. An electronic device, the electronic device comprising:

a processor;

a memory having stored thereon computer readable instructions which, when executed by the processor, implement the method of any of claims 1 to 5.