CN109034199B

CN109034199B - Data processing method and device, storage medium and electronic equipment

Info

Publication number: CN109034199B
Application number: CN201810664630.7A
Authority: CN
Inventors: 刘岩
Original assignee: Taikang Insurance Group Co Ltd
Current assignee: Taikang Insurance Group Co Ltd
Priority date: 2018-06-25
Filing date: 2018-06-25
Publication date: 2022-02-01
Anticipated expiration: 2038-06-25
Also published as: CN109034199A

Abstract

The invention discloses a data processing method and device, a storage medium and electronic equipment, and relates to the technical field of computers. The data processing method comprises the following steps: acquiring the comprehensive confidence coefficient of the same field in each data source according to the data source confidence coefficient of each data source of the target object and the field confidence coefficient of the same field in each data source; and obtaining the fusion confidence coefficient of the same field in each data source according to the comprehensive confidence coefficient of the same field in each data source and the similarity of the same field among the data sources. The method and the device can obtain the fusion confidence of the fields in different data sources through the comprehensive confidence of the fields in multiple data sources and the similarity of the fields among different data sources, thereby realizing the reliability evaluation of the same field in different data sources.

Description

Data processing method and device, storage medium and electronic equipment

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a data processing method, a data processing apparatus, a storage medium, and an electronic device.

Background

With the explosion of internet media, big data is becoming a mineral product for knowledge and wealth mining. The networking breaks through the unicity of the traditional information transmission channel, the diversity of data sources and the difference of data structures become basic characteristics of big data, and the fusion of the big data becomes a basic mode for constructing events or customer images.

The diversity of data sources is the basic characteristic of big data, and because the reliability of the data sources and the data in the data sources is often different, although the big data analysis methods and various open algorithm libraries are very numerous, the problem of data fusion calculation of the data sources with different credibility degrees is not solved at present.

In view of this, a data processing method, a data processing apparatus, a storage medium, and an electronic device are required.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present invention and therefore may include information that does not constitute prior art known to a person of ordinary skill in the art.

Disclosure of Invention

The invention aims to provide a data processing method, a data processing device, a storage medium and an electronic device, and further to overcome the problem of inaccurate data fusion caused by source diversity of data and different credibility of data of different data sources at least to a certain extent.

According to an aspect of the present invention, there is provided a data processing method including: acquiring the comprehensive confidence coefficient of the same field in each data source according to the data source confidence coefficient of each data source of the target object and the field confidence coefficient of the same field in each data source; and obtaining the fusion confidence coefficient of the same field in each data source according to the comprehensive confidence coefficient of the same field in each data source and the similarity of the same field among the data sources.

Optionally, obtaining a fusion confidence of the same field in each data source according to the comprehensive confidence of the same field in each data source and the similarity of the same field between each data source, includes: determining the same field of a first data source in each data source as a first reference field; calculating the similarity between the same field in each data source and the first reference field; and obtaining the fusion confidence coefficient of the first reference field according to the comprehensive confidence coefficient of the same field in each data source and the similarity between the same field in each data source and the first reference field.

Optionally, determining the same field of a first data source in the data sources as the first reference field includes: and selecting the data source with the highest comprehensive confidence coefficient of the same field in all the data sources as the first data source, and using the same field of the first data source as the first reference field.

Optionally, obtaining a fusion confidence of the first reference field according to the comprehensive confidence of the same field in each data source and the similarity between the same field in each data source and the first reference field, includes: and weighting and summing the comprehensive confidence of the same field in each data source and the similarity between the same field in each data source and the first reference field to obtain the fusion confidence of the first reference field.

Optionally, obtaining a comprehensive confidence level of the same field in each data source according to the data source confidence level of each data source of the target object and the field confidence level of the same field in each data source, includes: and taking the product of the data source confidence of each data source and the field confidence of the same field in the corresponding data source as the comprehensive confidence of the same field in each data source.

Optionally, the method further comprises: acquiring a plurality of data sources comprising the related information of the target object according to the unique identifier of the target object; obtaining data source confidence of each data source; a field confidence for each field in each data source is obtained.

Optionally, the method further comprises: normalizing the data source confidence of each data source; and/or normalizing the field confidence for each field in each data source.

According to an aspect of the present invention, there is provided a data processing apparatus comprising: the comprehensive confidence coefficient acquisition module is configured to acquire the comprehensive confidence coefficient of the same field in each data source according to the data source confidence coefficient of each data source of the target object and the field confidence coefficient of the same field in each data source; and the fusion confidence coefficient acquisition module is configured to acquire the fusion confidence coefficient of the same field in each data source according to the comprehensive confidence coefficient of the same field in each data source and the similarity of the same field among the data sources.

Optionally, the fusion confidence obtaining module includes: the reference field determining unit is configured to determine the same field of a first data source in the data sources as a first reference field; the similarity calculation unit is configured to calculate the similarity between the same field in each data source and the first reference field; and the fusion confidence coefficient acquisition unit is configured to acquire the fusion confidence coefficient of the first reference field according to the comprehensive confidence coefficient of the same field in each data source and the similarity between the same field in each data source and the first reference field.

Optionally, the reference field determining unit includes: and the reference field determining subunit is configured to select a data source with the highest fusion confidence coefficient of the same field in the data sources as the first data source, and the same field of the first data source is used as the first reference field.

Optionally, the fusion confidence obtaining unit includes: and the fusion confidence obtaining subunit is configured to perform weighted summation on the comprehensive confidence of the same field in each data source and the similarity between the same field in each data source and the first reference field to obtain the fusion confidence of the first reference field.

Optionally, the comprehensive confidence obtaining module includes: and the comprehensive confidence acquiring unit is configured to take the product of the data source confidence of each data source and the field confidence of the same field in the corresponding data source as the comprehensive confidence of the same field in each data source.

Optionally, the apparatus further comprises: the data source acquisition module is configured to acquire a plurality of data sources comprising the related information of the target object according to the unique identifier of the target object; the data source confidence coefficient obtaining module is configured to obtain the data source confidence coefficient of each data source; a field confidence obtaining module configured to obtain a field confidence for each field in each data source.

Optionally, the apparatus further comprises: a normalization module configured to: normalizing the data source confidence of each data source; and/or normalizing the field confidence for each field in each data source.

According to an aspect of the present invention, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, implements a data processing method as described in any one of the above.

According to an aspect of the present invention, there is provided an electronic apparatus including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform any one of the data processing methods described above via execution of the executable instructions.

In the technical solutions provided by some embodiments of the present invention, a comprehensive confidence of the same field in each data source is obtained according to the data source confidence of each data source of the target object and the field confidence of the same field in each data source, and further, a fusion confidence of the same field in each data source is obtained according to the comprehensive confidence of the same field in each data source and the similarity of the same field between each data source, so that reliability evaluation of the same field in different data sources can be achieved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. In the drawings:

FIG. 1 schematically illustrates a flow chart of a data processing method according to an exemplary embodiment of the present invention;

fig. 2 schematically shows a flow chart of an exemplary embodiment of step S120;

FIG. 3 schematically illustrates a flow chart of another data processing method according to an exemplary embodiment of the present invention;

FIG. 4 is a diagram that schematically illustrates data source confidence for data sources, in accordance with an illustrative embodiment of the present invention;

FIG. 5 is a diagram that schematically illustrates field confidence for the same field in various data sources, in accordance with an exemplary embodiment of the present invention;

FIG. 6 is a diagram that schematically illustrates the aggregate confidence of the same field in various data sources, in accordance with an exemplary embodiment of the present invention;

FIG. 7 schematically illustrates a diagram of a first reference field according to an exemplary embodiment of the present invention;

FIG. 8 is a diagram that schematically illustrates the similarity of identical fields among data sources, in accordance with an illustrative embodiment of the present invention;

FIG. 9 is a diagram that schematically illustrates a fused confidence for a first reference field in a first data source, in accordance with an exemplary embodiment of the present invention;

FIG. 10 schematically shows a block diagram of a data processing apparatus according to an exemplary embodiment of the present invention;

fig. 11 schematically shows a block diagram of an electronic device according to an exemplary embodiment of the present invention.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the invention.

Furthermore, the drawings are merely schematic illustrations of the invention and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the steps. For example, some steps may be decomposed, and some steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

Fig. 1 schematically shows a flow chart of a data processing method according to an exemplary embodiment of the present invention.

As shown in fig. 1, the data processing method provided by the embodiment of the present invention may include the following steps.

In step S110, a comprehensive confidence of the same field in each data source is obtained according to the data source confidence of each data source of the target object and the field confidence of the same field in each data source.

In an exemplary embodiment, the method may further include: acquiring a plurality of data sources comprising the related information of the target object according to the unique identifier of the target object; obtaining data source confidence of each data source; a field confidence for each field in each data source is obtained.

In the embodiment of the present invention, the target object may be determined according to different application scenarios, for example, the method is applied to a client representation scenario of an insurance company, and the target object may be an insurance client or a potential insurance client; the method is applied to a citizen portrait scene of a public security system, and the target object can be a citizen; and so on.

In the embodiment of the present invention, the unique identifier of the target object may be any one or more of a mobile phone number, an identification number, and the like of the target object, as long as the unique identifier can uniquely distinguish each target object.

In the embodiment of the present invention, the target object related information may be any information as long as the target object is related to, for example, an address, a name, a bank card number, online shopping behavior data, a place of residence, a work unit, and the like of the target object, and the present invention is not limited thereto.

In this embodiment of the present invention, the plurality of data sources may be, for example: data sources from different sensing devices, data sources from different institutions including customer activity data and/or attribute data, etc.

In the embodiment of the present invention, the credibility of different data sources in the plurality of data sources is different, and the credibility of different fields in the same data source may also be different, for example: the name information of the target object in the data source of the household registration of the public security department is relatively accurate, but the address information about the target object may not be accurate as the data in the data source of the express company, so when a data analyst faces such data, the embodiment of the invention solves the problem by designing the data source confidence of each data source and the field confidence of each field in each data source.

In an exemplary embodiment, the method may further include: normalizing the data source confidence of each data source; and/or normalizing the field confidence for each field in each data source.

In an exemplary embodiment, the obtaining, according to the data source confidence of each data source of the target object and the field confidence of the same field in each data source, the comprehensive confidence of the same field in each data source may include: and taking the product of the data source confidence of each data source and the field confidence of the same field in the corresponding data source as the comprehensive confidence of the same field in each data source. However, the present invention is not limited thereto, and it is within the scope of the present invention to arbitrarily obtain the comprehensive confidence of the same field in each data source according to the data source confidence of each data source of the target object and the field confidence of the same field in each data source.

In step S120, a fusion confidence of the same field in each data source is obtained according to the comprehensive confidence of the same field in each data source and the similarity of the same field between each data source.

The data processing method provided by the embodiment of the invention is based on the assumption that data of a plurality of data sources are not completely credible, aims at the problems that credibility of different data sources is different and credibility of different fields in different data sources is different, obtains comprehensive confidence of the same field in each data source according to the data source confidence of each data source of a target object and the field confidence of the same field in each data source, and further obtains fusion confidence of the same field in each data source according to the comprehensive confidence of the same field in each data source and the similarity of the same field between the data sources, thereby realizing reliability evaluation of the same field in different data sources and helping solve the problem of user information conflict in the plurality of data sources. The data processing method provided by the embodiment of the invention can be applied to data fusion of multiple data sources, and the fused data can carry the fusion confidence, so that the fused data can be applied to user portraits of various application scenes, such as related fields of insurance, public security, banks and the like.

Fig. 2 schematically shows a flow chart of an exemplary embodiment of step S120.

As shown in fig. 2, the step S120 shown in fig. 1 may further include the following steps.

In step S121, the same field of the first data source among the data sources is determined as the first reference field.

In an exemplary embodiment, the determining, as the first reference field, the same field of a first data source of the data sources may include: and selecting the data source with the highest comprehensive confidence coefficient of the same field in all the data sources as the first data source, and using the same field of the first data source as the first reference field.

For example, the combined confidence degrees of the same selected fields in the data sources may be sorted in a descending order, and the data source with the highest combined confidence degree is selected as the first data source, and then the same selected field in the first data source is used as the first reference field. Only the same field in the first data source with the highest comprehensive confidence is selected as the first reference field, so that the data calculation amount can be greatly reduced, and the content of the selected field with the highest comprehensive confidence can be generally considered to be the highest probability of being the real content from a practical viewpoint. However, the present invention is not limited to this, and for example, the selected same field in each data source may be used as the first reference field in sequence, and then the fusion confidence of the first reference field may be calculated.

In step S122, the similarity between the same field in each data source and the first reference field is calculated.

In step S123, a fusion confidence of the first reference field is obtained according to the comprehensive confidence of the same field in each data source and the similarity between the same field in each data source and the first reference field.

In an exemplary embodiment, the obtaining the fusion confidence of the first reference field according to the combined confidence of the same field in each data source and the similarity between the same field in each data source and the first reference field may include: and weighting and summing the comprehensive confidence of the same field in each data source and the similarity between the same field in each data source and the first reference field to obtain the fusion confidence of the first reference field.

The data processing method provided by the embodiment of the invention is based on the assumption that the data of multiple data sources are not completely credible, and non-uniformization of the value density and credibility of different data sources is provided, so that the method for fusion analysis of the big data of multiple data sources is provided, the data analysis is more scientific, the conclusion is more in line with the fact, and the method is a data fusion method with valuable evaluation reference values.

Fig. 3 schematically shows a flow chart of another data processing method according to an exemplary embodiment of the present invention.

As shown in fig. 3, the data processing method provided by the embodiment of the present invention may include the following steps.

In step S310, a plurality of data sources including target object related information are acquired.

For example, several attribute data source tables for a target object are selected based on the unique identification of the target object.

In step S320, a data source confidence of each data source is obtained.

For example, the confidence evaluation is performed on the different data source tables of the selected target object.

Before the data confidence analysis of different data sources, credibility marking scoring can be carried out on the different data sources to be processed to obtain the data source confidence of each data source, so as to identify the credibility of the corresponding data source data.

In the embodiment of the present invention, the value range of the data source confidence of each data source may be limited to 0 to 1, but the present invention is not limited thereto, and in other embodiments, the value range of the data source confidence of each data source may also be autonomously set according to actual requirements.

In the embodiment of the invention, credibility mark scores of different data sources are mainly evaluated by experience, and the experience contents can comprise a business requirement target, confidence on credibility of different data sources, confidence on credibility of each data field in different data sources, a rule that the credibility of data of different data sources decays along with time and the like.

For example, a public security data source and a bank data source can be preferentially used as a source of attribute data of a target object which is stable for a long time, an express delivery data source is preferentially used as a source of attribute data of the target object which is easy to change for a short time, and the like, and meanwhile, feedback adjustment can be performed on credibility mark scores of different data sources by combining actual effects.

In the embodiment of the invention, credibility mark scoring of each data source can adopt two modes.

The first mode is that a relatively independent scoring method is adopted among the data sources, namely the data source confidence of a single data source is between 0 and 1, and the sum of the data source confidence of a plurality of data sources can exceed 1.

The second mode is that a joint credibility mark scoring method is adopted among the data sources, namely the data source confidence of a single data source is between 0 and 1, and the sum of the data source confidence of a plurality of data sources is between 0 and 1, and the method can be realized on the basis of the first mode.

For example, the joint credibility score for the data source confidence for each data source implemented in the second manner described above may include the following steps:

1) giving relatively independent credibility mark scores of each data source, and marking as X₁，X₂，...，X_nRespectively representing the data source confidence degrees of n data sources of the target object, wherein n is a positive integer greater than or equal to 1;

2) a joint credibility score is calculated for each data source, for example, normalization of data source confidence may be achieved by the following equation (1) or equation (2).

For example, the normalized data source confidence score (X) of the ith data source of the n data sources can be calculated by using the following formula_i) Wherein i is more than or equal to 1 and less than or equal to n:

for another example, the normalized data source confidence score (X) of the ith data source of the n data sources may be calculated by using the following formula_i) Wherein i is more than or equal to 1 and less than or equal to n:

in the above formula (2), max and min represent the maximum and minimum operations, respectively.

It should be noted that the present invention is not limited to the normalization operation of the data source confidence level only by using the above formula (1) or (2), and any other suitable normalization method may be used.

For example, suppose that the target object has A, B, C, D, E five data sources, and different data source confidences are respectively assigned to the data sources, as shown in fig. 4, the data source confidence of the data source a is 0.6, the data source confidence of the data source B is 0.5, the data source confidence of the data source C is 0.4, the data source confidence of the data source D is 0.3, and the data source confidence of the data source E is 0.2. It is assumed that the data source confidence of each data source is not normalized, but in other embodiments, the data source confidence of each data source may be normalized.

In step S330, field confidence of each field of each data source is obtained.

In the embodiment of the invention, the field confidence evaluation can be carried out on the fields in different data sources according to the target to be finished.

In the embodiment of the invention, before fusion confidence analysis is carried out on data of different data sources, credibility marking and scoring can be carried out on each key field under different data sources to be processed so as to identify the credibility of different fields in different data sources.

In this embodiment of the present invention, the key field may be, for example, a name field, an address field, an age field, and the like of the target object.

In the embodiment of the present invention, the value range of the field confidence of different fields of each data source may be limited to 0 to 1, but the present invention is not limited thereto, and in other embodiments, the value range of the field confidence of different fields of each data source may also be set autonomously according to actual requirements.

In the embodiment of the invention, in addition to the data source confidence evaluation on the data source, the field confidence evaluation is also carried out on each key field in each data source, and the initial evaluation of the field confidence is mainly based on experience.

For example, it is common for the name field to have a higher confidence in the police department data source than in the courier data source, for the contact field to have a lower confidence in the inbound and outbound data sources than in the bank data source, and so on. In addition, the field confidence optimization can be evaluated and adjusted in combination with the feedback effect of the use of the actual service.

Similarly, in the embodiment of the present invention, the field confidence may also be evaluated in an independent evaluation manner and a joint evaluation manner, where the independent evaluation of the field confidence is similar to the independent scoring of the data source confidence of each data source in step S320, and the joint evaluation of the field confidence is similar to the joint scoring of the data source confidence of each data source in step S320.

For example, a joint credibility score for achieving field confidence among data sources using a joint evaluation approach may include the following steps:

1) giving a relatively independent confidence score, denoted Y, for the same selected field of each data source₁，Y₂，...，Y_nField confidence levels of the same selected field of n data sources of the target object are respectively represented, wherein n is a positive integer greater than or equal to 1;

2) a joint credibility score for the field confidence of the selected same field of each data source is calculated, for example, normalization of the data source confidence may be achieved by the following formula (3) or formula (4).

For example, the following formula may be used to calculate the field confidence score (Y) of the selected same field normalized by the ith data source of the n data sources_i) Wherein i is more than or equal to 1 and less than or equal to n:

for another example, the following formula may be used to calculate the field confidence score (Y) of the selected same field after the ith data source of the n data sources is normalized_i) Wherein i is more than or equal to 1 and less than or equal to n:

in the above equation (4), max and min represent the maximum and minimum operations, respectively.

It should be noted that the present invention is not limited to the normalization operation of the data source confidence degree only by using the above formula (3) or (4), and any other suitable normalization method may be used.

For example, the NAME fields under the above A, B, C, D, E data sources are respectively selected, and different field confidence degrees are given to the NAME fields in each data source, as shown in fig. 5, the confidence degree of the NAME field in the data source a is 0.2, the confidence degree of the NAME field in the data source B is 0.1, the confidence degree of the NAME field in the data source C is 0.4, the confidence degree of the NAME field in the data source D is 0.8, and the confidence degree of the NAME field in the data source E is 0.3.

In step S340, the integrated confidence of the same selected field in each data source is obtained.

In the embodiment of the invention, when data fusion confidence coefficient analysis of different data sources is carried out, comprehensive confidence coefficient scoring can be carried out on each key field under different data sources so as to mark the credibility of the selected same field under the corresponding data source.

In the embodiment of the present invention, the range of the value of the integrated confidence of the same selected field in each data source may be between 0 and 1, but the present invention is not limited thereto.

In reality, for the same field selected in different data sources, the content of the same field in different data sources may be inconsistent, for example, the name field, and the names in different data sources may be different.

In the embodiment of the present invention, after the two steps of step S320 and step S330, before performing fusion confidence analysis on each selected field in the same data source, a comprehensive confidence score of each field in different data sources needs to be calculated, and the score may be obtained by a product of the confidence of the data source and the confidence of the selected field in the same field in the corresponding data source.

For example, also taking the above-mentioned A, B, C, D, E data sources as an example, selecting a NAME field, and then multiplying the field confidence of the NAME field of the target object by the data source confidence of the corresponding data source, the integrated confidence of the NAME field for fusion analysis can be obtained, as shown in fig. 6, the integrated confidence of the NAME field in the data source a is 0.6 × 0.2 — 0.12, the integrated confidence of the NAME field in the data source B is 0.5 × 0.1 — 0.05, the integrated confidence of the NAME field in the data source C is 0.4 × 0.4 — 0.16, the integrated confidence of the NAME field in the data source D is 0.3 × 0.8 — 0.24, and the integrated confidence of the NAME field in the data source E is 0.2 × 0.3 — 0.06.

In step S350, the similarity of the same field between different data sources is obtained.

In the embodiment of the present invention, a first reference field for calculating the similarity may be determined first.

In this embodiment of the present invention, for the same field in different data sources, the comprehensive confidence of the same field may be calculated according to the manner in step S340, and the comprehensive confidence of the same field of each data source may be arranged in an ascending order or a descending order, and then the same field in the data source with the highest comprehensive confidence is selected as the first reference field.

In other embodiments, the calculation and selection of the first reference field may be performed for all fields of the data sources that need to participate in the fusion analysis.

For example, taking the above A, B, C, D, E five data sources as an example, selecting a NAME field, then multiplying the data source confidence of the NAME field of the target object by the field confidence of the corresponding data source to obtain a comprehensive confidence of the NAME field for fusion analysis, as shown in fig. 7, corresponding to the last row in fig. 7, then selecting a field with the highest comprehensive confidence as a first reference field for subsequent similarity calculation, and using the NAME field content JACK break in the underlined data source D in fig. 7 as the currently selected first reference field.

Then, for the same field selected in each data source, the similarity evaluation of the same field among different data sources is performed one by one to find the most reliable field content.

In the embodiment of the present invention, the range of the similarity of the same field between different data sources may be between 0 and 1, but the present invention is not limited thereto.

Because the number of fields involved in big data analysis is usually large, in order to avoid that the calculation of the cross similarity of multiple fields takes too long, in the embodiment of the present invention, the first reference field with the highest comprehensive confidence may be used as a base, and then the similarity scores of the base field to the same fields in other data sources may be calculated respectively.

In the embodiment of the present invention, the way of calculating the similarity score of the base field to the same field in other data sources differs according to the difference of the field contents.

For example, if the content of the currently selected field is a numerical value, then the euclidean distance may be used as a measure of similarity of the selected field between different data sources; for another example, if the currently selected field content is a string, then string matching may be used as a measure of similarity of the selected field between different data sources; as another example, if the currently selected field content is a segmented string, then segmented string matching may be used as a measure of similarity for the selected field between different data sources; for another example, if the currently selected field content is a multi-type hybrid character string, such as a JSON string, then the corresponding similarity measure method may be used to perform the calculation according to the type of the parsed key-value content.

For example, taking the above A, B, C, D, E data sources as an example, the NAME field is selected, the NAME-JACK break field in the data source D with the highest comprehensive confidence is selected as the first reference field, and then the similarities between the first reference field and the same fields in other data sources are calculated respectively, as shown by the dotted line in fig. 8, where the numbers marked on the dotted line are the similarities between the corresponding data source and the first reference field, positive signs represent positive correlations, and negative signs represent negative correlations. As shown in FIG. 8, the similarity between the NAME field content Jack Bron in data source A and the NAME field content JACK BRON in data source D is 1, the similarity between the NAME field content Jack in data source B and the NAME field content JACK BRON in data source D is 0.5, the similarity between the NAME field content Lucy in data source C and the NAME field content JACK BRON in data source D is-1, and the similarity between the NAME field content Lucy Bron in data source E and the NAME field content JACK BRON in data source D is-0.5. The similarity between the NAME field content JACK BRON in the data source D and the NAME field content JACK BRON in the data source D is 1.

It should be noted that, in the embodiment shown in fig. 8, on one hand, the similarity of the NAME field between the other data source and the data source D is calculated based on the NAME field as the character string, and on the other hand, the case of the letter is not considered, but in other fields, the case of the same letter may also be considered differently; second, as a general matter, surnames in a name are the same and considered to be positively related, while surnames that differ are considered to be negatively related. However, the similarity calculation is only used for illustrating the similarity calculation, and is not limited to the similarity calculation method, and in other embodiments, the similarity calculation may be performed in any other suitable manner, for example, the similarity between different character strings may be calculated by using cosine similarity.

In step S360, a fusion confidence of the same field is obtained according to the comprehensive confidence and the similarity.

In the embodiment of the invention, the calculated similarity and the comprehensive confidence degree of the same field in each data source are combined to calculate the fusion confidence degree of the selected same field, and the most possible field content of the field can be obtained according to the fusion confidence degrees of the same field in different data sources for data result evaluation.

In the embodiment of the present invention, the range of the fusion confidence of the same field selected in each data source may be between 0 and 1, but the present invention is not limited thereto.

In the embodiment of the invention, after the comprehensive confidence score and the similarity score result of different fields in a plurality of data sources are obtained, the fusion confidence score of each field exposed to the whole outside after multi-source data fusion is calculated. Due to the difference of data quality between different data sources, a strong or weak positive correlation and a negative correlation exist between the data fields, wherein the strong and weak correlation can be described by integrating the confidence degree and the similarity, and the positive and negative correlation can be described by the sign of the similarity.

In the embodiment of the invention, after the comprehensive confidence and the similarity of each field are obtained, the content of each field can be subjected to fusion diagnosis and analysis by using a weighted summation mode.

For example, assume that the target object has a common A₁，A₂，…，A_nN data sources, n is a positive integer greater than or equal to 1, a common field M is selected from each data source, and the comprehensive confidence scores of the common field M in the n data sources are respectively a₁，a₂，…，a_nWherein the jth (1) is assumed<＝j<N) the M field in the data source is the first reference field, denoted as:

further assume that the similarity s of the M field corresponding to the jth data source to the M fields in other data sources_jThe vectorization of (c) is expressed as:

s_j＝(s_j1,s_j2,...,s_ji,...,s_jn) (6)

where i denotes an M field in the ith data source, and when i ═ j, the similarity to itself is 1.

Multi-source data fusion confidence L based on jth first reference field_jIt can be expressed by the following formula:

L_j＝s_j1*a₁+s_j2*a₂+…+s_jn*a_n (7)

for example, also taking the above-mentioned A, B, C, D, E data sources as an example, if the NAME field is selected, the similarity score for the first base field NAME-JACK BRON is indicated by the dashed line in fig. 9, and the composite confidence score is shown in the second last row of fig. 9, then the fused confidence score for the base field may be 0.195-0.12-1 + 0.05-0.5-0.16-1 + 0.24-1-0.06-0.5, i.e., the fused confidence that the NAME field is JACKBRON is 0.195.

Similarly, the NAME field in the data source a may also be selected as a second reference field, then the similarity between the NAME fields in the data source B, the data source C, the data source D, and the data source E and the second reference field is calculated respectively, and then the fusion confidence of the NAME fields in the data source a is obtained by performing weighted summation according to the comprehensive confidence of the NAME fields in the data sources and the corresponding similarities. And the calculation method of the fusion confidence of the NAME fields in other data sources is similar to the calculation method of the fusion confidence of the NAME fields in other data sources.

In the embodiment of the present invention, the method may further include: and after fusion confidence evaluation is carried out on all fields of each data source, an attribute data table of the fused target object is constructed.

In the embodiment of the present invention, the method may further include: and carrying out statistical analysis based on the attribute data table of the fused target object, wherein the statistical analysis result has a fusion confidence coefficient parameter.

The data processing method provided by the embodiment of the invention aims at the problem of big data fusion analysis that data sources are various and the data sources and data contents have reliability difference, when the data fusion analysis of multiple data sources is carried out, the confidence evaluation is carried out on the data sources and fields, the comprehensive confidence is obtained according to the confidence of the data sources and the confidence of the fields, and the fusion confidence is obtained according to the comprehensive confidence and the similarity, so that the data analysis result has a fusion confidence score, the technical problems of data conflict among the data sources and uneven distribution of the confidence density of multi-source data can be solved, and the accuracy of user portrait information in each field can be improved, such as the financial insurance field, the public security field, citizen identity portrait, behavior portrait, public opinion situation prediction, security situation prediction and the like.

It should be noted that although the steps of the methods of the present invention are depicted in the drawings in a particular order, this does not require or imply that the steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.

Further, the present exemplary embodiment also provides a data processing apparatus.

Fig. 10 schematically shows a block diagram of a data processing device according to an exemplary embodiment of the present invention. Referring to fig. 10, the data processing apparatus 1000 according to an exemplary embodiment of the present invention may include an integrated confidence obtaining module 1010 and a fused confidence obtaining module 1020.

Specifically, the integrated confidence obtaining module 1010 may be configured to obtain the integrated confidence of the same field in each data source according to the data source confidence of each data source of the target object and the field confidence of the same field in each data source. The fusion confidence obtaining module 1020 may be configured to obtain the fusion confidence of the same field in each data source according to the comprehensive confidence of the same field in each data source and the similarity of the same field between each data source.

In the data processing apparatus according to an exemplary embodiment of the present invention, the fusion confidence obtaining module 1020 may include: a reference field determining unit, which may be configured to determine a same field of a first data source of the data sources as a first reference field; a similarity calculation unit, which may be configured to calculate a similarity between the same field in each data source and the first reference field; a fusion confidence obtaining unit, configured to obtain a fusion confidence of the first reference field according to a comprehensive confidence of the same field in each data source and a similarity between the same field in each data source and the first reference field.

In the data processing apparatus of an exemplary embodiment of the present invention, the reference field determination unit may further include: a reference field determination subunit, where the reference field determination subunit may be configured to select, as the first data source, a data source with a highest comprehensive confidence of a same field in the data sources, and use the same field of the first data source as the first reference field.

In the data processing apparatus of an exemplary embodiment of the present invention, the fusion confidence obtaining unit may include: a fusion confidence obtaining subunit, configured to obtain a fusion confidence of the first reference field by weighted summation of the comprehensive confidence of the same field in each data source and the similarity between the same field in each data source and the first reference field.

In the data processing apparatus according to an exemplary embodiment of the present invention, the integrated confidence obtaining module 1010 may include: a composite confidence acquisition unit, which may be configured to take the product of the data source confidence of each data source and the field confidence of the same field in the corresponding data source as the composite confidence of the same field in each data source.

In the data processing apparatus according to the exemplary embodiment of the present invention, the apparatus 1000 may further include: a data source obtaining module, which may be configured to obtain a plurality of data sources including the information related to the target object according to the unique identifier of the target object; a data source confidence obtaining module, which may be configured to obtain a data source confidence for each data source; a field confidence obtaining module, which may be configured to obtain a field confidence for each field in each data source.

In the data processing apparatus according to the exemplary embodiment of the present invention, the apparatus 1000 may further include: a normalization module, which may be configured to: normalizing the data source confidence of each data source; and/or normalizing the field confidence for each field in each data source.

Since each functional module of the data processing apparatus according to the embodiment of the present invention is the same as that in the embodiment of the present invention, it is not described herein again.

In an exemplary embodiment of the present invention, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, aspects of the invention may also be implemented in the form of a program product comprising program code means for causing a terminal device to carry out the steps according to various exemplary embodiments of the invention described in the above section "exemplary methods" of the present description, when said program product is run on the terminal device.

According to the program product for realizing the method, the portable compact disc read only memory (CD-ROM) can be adopted, the program code is included, and the program product can be operated on terminal equipment, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

In an exemplary embodiment of the present invention, there is also provided an electronic device capable of implementing the above method.

As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

An electronic device 900 according to this embodiment of the invention is described below with reference to fig. 11. The electronic device 900 shown in fig. 11 is only an example and should not bring any limitations to the function and scope of use of the embodiments of the present invention.

As shown in fig. 11, electronic device 900 is embodied in the form of a general purpose computing device. Components of electronic device 900 may include, but are not limited to: the at least one processing unit 910, the at least one storage unit 920, a bus 930 connecting different system components (including the storage unit 920 and the processing unit 910), and a display unit 940.

Wherein the storage unit stores program code that is executable by the processing unit 910 to cause the processing unit 910 to perform steps according to various exemplary embodiments of the present invention described in the above section "exemplary methods" of the present specification. For example, the processing unit 910 may execute step S110 as shown in fig. 1: acquiring the comprehensive confidence coefficient of the same field in each data source according to the data source confidence coefficient of each data source of the target object and the field confidence coefficient of the same field in each data source; step S120: and obtaining the fusion confidence coefficient of the same field in each data source according to the comprehensive confidence coefficient of the same field in each data source and the similarity of the same field among the data sources.

The storage unit 920 may include a readable medium in the form of a volatile storage unit, such as a random access memory unit (RAM)9201 and/or a cache memory unit 9202, and may further include a read only memory unit (ROM) 9203.

Storage unit 920 may also include a program/utility 9204 having a set (at least one) of program modules 9205, such program modules 9205 including but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 930 can be any of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 900 may also communicate with one or more external devices 1000 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 900, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 900 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interface 950. Also, the electronic device 900 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet) via the network adapter 960. As shown, the network adapter 960 communicates with the other modules of the electronic device 900 via the bus 930. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the electronic device 900, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiment of the present invention can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to make a computing device (which can be a personal computer, a server, a terminal device, or a network device, etc.) execute the method according to the embodiment of the present invention.

Furthermore, the above-described figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the invention, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the invention. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is only limited by the appended claims.

Claims

1. A data processing method, comprising:

acquiring the comprehensive confidence coefficient of the same field in each data source according to the data source confidence coefficient of each data source of the target object and the field confidence coefficient of the same field in each data source;

selecting a data source with the highest comprehensive confidence coefficient of the same field in all the data sources as a first data source, and using the same field of the first data source as the first reference field;

calculating the similarity between the same field in each data source and the first reference field;

and obtaining the fusion confidence coefficient of the first reference field according to the comprehensive confidence coefficient of the same field in each data source and the similarity between the same field in each data source and the first reference field.

2. The data processing method of claim 1, wherein obtaining the fusion confidence of the first reference field according to the integrated confidence of the same field in each data source and the similarity between the same field in each data source and the first reference field comprises:

and weighting and summing the comprehensive confidence of the same field in each data source and the similarity between the same field in each data source and the first reference field to obtain the fusion confidence of the first reference field.

3. The data processing method of claim 1, wherein obtaining the comprehensive confidence level of the same field in each data source according to the data source confidence level of each data source of the target object and the field confidence level of the same field in each data source comprises:

and taking the product of the data source confidence of each data source and the field confidence of the same field in the corresponding data source as the comprehensive confidence of the same field in each data source.

4. The data processing method of claim 1, further comprising:

acquiring a plurality of data sources comprising the related information of the target object according to the unique identifier of the target object;

obtaining data source confidence of each data source;

a field confidence for each field in each data source is obtained.

5. The data processing method of claim 4, further comprising:

normalizing the data source confidence of each data source; and/or

The field confidence for each field in each data source is normalized.

6. A data processing apparatus, comprising:

the comprehensive confidence coefficient acquisition module is configured to acquire the comprehensive confidence coefficient of the same field in each data source according to the data source confidence coefficient of each data source of the target object and the field confidence coefficient of the same field in each data source;

a fusion confidence coefficient obtaining module configured to select a data source with the highest comprehensive confidence coefficient of the same field in the data sources as a first data source, and the same field of the first data source as the first reference field; calculating the similarity between the same field in each data source and the first reference field; and obtaining the fusion confidence coefficient of the first reference field according to the comprehensive confidence coefficient of the same field in each data source and the similarity between the same field in each data source and the first reference field.

7. A storage medium on which a computer program is stored, the computer program, when executed by a processor, implementing the data processing method of any one of claims 1 to 5.

8. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the data processing method of any of claims 1 to 5 via execution of the executable instructions.