CN106503228A

CN106503228A - A kind of packet scarcity appraisal procedure and its system

Info

Publication number: CN106503228A
Application number: CN201610970543.5A
Authority: CN
Inventors: 张斌德; 王军; 孙玉权
Original assignee: Guoxin Youe Data Co Ltd
Current assignee: Guoxin Youe Data Co Ltd
Priority date: 2016-10-28
Filing date: 2016-10-28
Publication date: 2017-03-15

Abstract

The present invention provides a kind of packet scarcity appraisal procedure and its system, and the method is comprised the following steps：S100：Obtain multiple related data packets related to given content；S200：Determine packet to be assessed, and determine the similarity between packet to be assessed and other packets, choose with the similarity between packet to be assessed higher than predetermined threshold packet as comparing packet；S300：The scarcity of packet to be assessed is determined using default processing method.The present invention is by being estimated to the scarcity of packet such that it is able to understand the quality of packet, the value assessment for data provides certain reference frame.

Description

A kind of packet scarcity appraisal procedure and its system

Technical field

The present invention relates to big data field, and in particular to a kind of packet scarcity appraisal procedure and its system.

Background technology

Data trade is currently in the industry initial stage, and development is very fast, but lacks the theoretical direction of maturation.By data value Quantization is an extremely difficult thing, this be by data substitutive characteristics and current business environment determined.Meanwhile, this One work will be also hindered by numerous objective factors, such as the accurate assessment of data compiling costs, the devaluation of data and Life Cycle Phase changes, and the surcharge of data etc..With data product conclude the business increasingly prevailing, how to judge the value of data, this The puzzlement for not only bringing to data selling business, also brings puzzlement to buyer.

Well-known viewpoint is that thing which is rare is dear, for data no exception.More rare data, its value are also corresponding About big.The scarcity of data message resource is divided at 2 points, and one is rare root source, i.e., data message resource is objective Sexual valence value；Two is the rare form of expression, the serviceability of data message resource cause rare be possibly realized, data message resource Nonhomogeneity causes rare becoming necessarily.

Therefore, how the scarcity of data is estimated, preferably to provide preferably service as data trade market Become problem urgently to be resolved hurrily.

Content of the invention

For above-mentioned technical problem, the present invention provides a kind of packet scarcity appraisal procedure and its system.

The technical solution used in the present invention is：

Embodiments of the invention provide a kind of packet scarcity appraisal procedure, including：

S100：Obtain multiple related data packets related to given content；

S200：Determine packet to be assessed, and determine the similarity between packet to be assessed and other packets, choose With the similarity between packet to be assessed higher than predetermined threshold packet as comparing packet；

S300：The scarcity of packet to be assessed is determined using default processing method, is assessed especially by equation below The scarcity of packet to be assessed：

Wherein, f is the scarcity score of packet to be assessed, and span is [0,1]；Y be except packet to be assessed with All data strip number sums in other outer packets；X is the data strip number in packet to be assessed.

Preferably, calculated between packet to be assessed and other packets using text similarity measurement algorithm in step s 200 Similarity, specifically include：

S210：Packet to be assessed and the text compared in packet are read in R LISP program LISPs, by participle instrument Or the text in each packet is split into single word by user-defined word segmentation regulation, determine Feature Words and count each spy The word frequency of word appearance is levied, and sets up document entry matrix；

S220：Packet to be assessed and the similarity compared between packet are calculated based on below equation：

Wherein, G is the similarity between packet to be assessed and other packets, and scope is [0,1]；N₁, N₂…N_mWith M₁, M₂…M_mThe number of times that each Feature Words in packet respectively to be assessed and other packets occur.

Preferably, when G is more than 0.5, represent that packet to be assessed has similarity with packet is compared；When G is more than When 0.85, packet to be assessed is represented with to compare packet highly similar.

Preferably, as f=0, represent that the data in packet to be assessed are not rare；As f=1, number to be assessed is represented Data according to bag compare in packet at other and do not exist, very rare.

Preferably, related to given content multiple to obtain by crawling the network data of the multiple data platforms in the Internet Related data packets.

Another embodiment of the present invention provides a kind of packet scarcity assessment system, including：

Data acquisition module, obtains multiple related data packets related to given content；Similarity assessment module, determination are treated Assessment packet, and determine the similarity between packet to be assessed and other packets, choose and packet to be assessed between Similarity higher than predetermined threshold packet as comparing packet；Scarcity evaluation module, using default processing method come Determine the scarcity of packet to be assessed, especially by the scarcity that equation below assesses packet to be assessed：

Alternatively, the similarity assessment module includes：Feature extraction unit, by keyword extraction instrument or makes by oneself Justice determines the Feature Words between packet to be assessed and the text compared in packet；Entry document matrix sets up unit, will treat Assessment packet is read in R LISP program LISPs with the text compared in packet, by participle instrument or user-defined participle Text in each packet is split into single word by rule, is counted the word frequency that each Feature Words occurs, and is set up document word Bar matrix；Similarity calculated, calculates packet to be assessed and the similarity compared between packet based on below equation：

Alternatively, the predetermined threshold is 0.5, and when G is more than 0.5, expression packet to be assessed has with packet is compared There is similarity；When G is more than 0.85, packet to be assessed is represented with to compare packet highly similar.

Alternatively, as f=0, represent that the data in packet to be assessed are not rare；As f=1, number to be assessed is represented Data according to bag compare in packet at other and do not exist, very rare.

Alternatively, the data acquisition module is obtained and refers to by crawling the network data of the multiple data platforms in the Internet Determine the related multiple related data packets of content.

The present invention is by being estimated to the scarcity of packet such that it is able to understands the quality of packet, is data Value assessment provides certain reference frame.

Description of the drawings

Fig. 1 is the schematic flow sheet of packet scarcity appraisal procedure provided in an embodiment of the present invention；

Fig. 2 is the structural representation of packet scarcity assessment system provided in an embodiment of the present invention.

Specific embodiment

Hereinafter, the specific embodiment of the present invention is described in conjunction with accompanying drawing.

【Embodiment 1】Packet scarcity appraisal procedure

Fig. 1 is the schematic flow sheet of packet scarcity appraisal procedure provided in an embodiment of the present invention.As shown in figure 1, this The packet scarcity appraisal procedure that embodiment is provided, including：

S100：Obtain related data packets

Specifically, given content can be based on, the phase on each big data business site is crawled using Python programming Packet is closed, and the data for crawling are stored in relevant database MySQL database, can be included inside packet various The file of data type, such as JSON, picture, video, audio frequency etc. file.The detailed process for crawling is：User input network address Afterwards, through dns server, server host being found, a request being sent to server, server is sent out after parsing The files such as browser HTML, JS, CSS of user are given, browser resolves are out.Therefore, the webpage that user sees substantially be by HTML code is constituted, and it is these contents that reptile climbs come, by analyzing and filtering these HTML codes, realize to picture, Crawling for the resource such as word and upload adnexa, thus can illustrate for packet to the website that each big data is concluded the business Related content is crawled.Thus, the multiple related data packets comprising same body content can be obtained.It is of course also possible to will The packet that obtained is chosen before evaluation operation to be estimated, rather than is crawled in evaluation operation in real time.

S200：The similarity between packet is calculated, the packet that similarity exceedes predetermined threshold is chosen

Specifically, a packet to be assessed can be determined according to practical situation, for example, it is desired to certain data provision platform The scarcity of data be estimated, the packet that the data provision platform is provided can be appointed as packet to be assessed, then The similarity between the packet to be assessed and other packets is calculated using text similarity measurement algorithm, similarity is chosen and is surpassed The packet of predetermined threshold is crossed, as comparing packet.Step S200 may particularly include：

S210：Text in packet is read in R LISP program LISPs, is advised by participle instrument or user-defined participle The text in each packet in the related data packets is split into single word then, Feature Words is determined and is counted each spy The word frequency of word appearance is levied, and sets up document entry matrix, for example, with regard to the packet of three import and export products, the entry of foundation Document matrix can be as shown in table 1 below：

Table 1：Entry document matrix

Feature	Declaration	Outlet	Port	Provinces and cities	Quantity	Originate in	Species	The amount of money	Specification
										Text 1	2	4	1	2	6	2	2	7	0
Text 2	1	5	4	3	8	2	2	5	1
										Text 3	3	1	4	0	1	8	7	2	3

Numeral wherein in table 1 represents the number of times of the Feature Words occurred in corresponding text.

S230：Calculate the similarity between packet

The similarity between two packets can be calculated using following formula 1：

【Formula 1】

Wherein, G is the similarity between two packets, and scope is [0,1]；N₁, N₂…N_mAnd M₁, M₂…M_mRespectively compare Compared with two packets in the number of times that occurs of each Feature Words.In the present embodiment, predetermined threshold can be 0.5, i.e., when G is more than When 0.5, represent that two packets are similar；When G is more than 0.85, represent that two packets are highly similar.

By taking table 1 as an example, the word occurred in text 1 is：C1、C2、C3、C4……Cn；The number of times that these words occur is respectively： N1, N2, N3 ... Nm, in text 2, the word of appearance is：C1、C2、C3、C4……Cn；The number of times that these words occur is respectively：M1、 M2、M3……Mm.Wherein, C1 represents same word in two texts, and N1 and M1 is that they distinguish corresponding number, then can base The similarity between text 1 and text 2 is calculated in above-mentioned formula, and calculating process is as follows：

As the similarity score between text 1 and text 2 is 0.97, more than 0.85, therefore, can determine whether comprising text 1 Packet and the packet comprising text 2 between there is high similarity.If it is determined that the scarcity of needs assessment text 1, Then can be using the data comprising text 2 as comparing packet.Equally, the similarity between text 1 and text 3 can be calculated, is led to It is 0.4 to cross the similarity score that can be calculated between text 1 and text 3, less than 0.5, then it represents that the packet comprising text 1 with Similarity between packet comprising text 3 is not high, can not be using the packet comprising text 3 as comparing packet.Certainly, When the scarcity of needs assessment text 2, then it is that correlation technique is identical with text 1 according to calculating similarity with text 2, when During the scarcity of needs assessment text 3, and so.

S300：Calculate the scarcity of packet to be assessed

When scarcity is calculated, need to select a packet to be assessed, the packet to be assessed can be according to practical situation To determine.If homogeneous data is more, then it represents that scarcity is lower；If homogeneous data is fewer, then it represents that scarcity is higher.

For specified packet to be assessed, the scarcity of the packet can be assessed by equation below 2：

【Formula 2】

Wherein, f is the scarcity score of packet to be assessed, and span is [0,1]；Y be except packet to be assessed with All data strip number sums in other outer packets；X is the data strip number in packet to be assessed.Data strip number can be according to pre- If rule is determining, for example, it can be one section of text in short or with regard to certain event etc..As f=0, number to be assessed is represented Data according to bag are not very rare；As f=1, represent the data in packet to be assessed in other packets to be assessed Do not exist, very rare.

The assessment of scarcity is illustrated below by way of an example.

Example

First, two data provision platforms are crawled by Python programming according to given content " information-based related " Related data packets 1 and 2 on 1 and 2, and determine the scarcity of assessment packet 1.

Then, the content according to disclosed in step S200 sets up the entry of the packet with regard to the two data provision platforms Document matrix, as shown in table 2 below：

Table 2

	Data	Field	Information	Microblogging	Machine	Society	Time	Public sentiment	Study	Collection
											Packet 1	1	2	3	2	1	1	1	1	1	1
Packet 22	1	1	1	2	0	0	1	3	2	5

Then, the similarity between the two packets is calculated using above-mentioned formula 1, obtain the phase between two packets Seemingly spend and must be divided into 0.63, show that the two packets are similar.

Learnt by statistics, the total data bar number of packet 1 and 2 is 6,000,000, and the data strip number of wherein packet 1 is 5000000, the data strip number of packet 2 is 1,000,000, and the scarcity for calculating packet 1 using above-mentioned formula 2 is：

This expression, the scarcity of packet 1 are very rare.

【Embodiment 2】Packet scarcity assessment system

Fig. 2 is the structural representation of packet scarcity assessment system provided in an embodiment of the present invention.As shown in Fig. 2 this The packet scarcity assessment system that embodiment is provided, assesses including data acquisition module, similarity assessment module and scarcity Module.

Wherein, data acquisition module is used for obtaining multiple related data packets related to given content.Can be by passing through net Network crawls the network data of multiple data provision platforms to obtain multiple related data packets related to given content.For example, may be used Given content is based on, and the related data packets on each big data business site is crawled using Python programming, and will be climbed The data for taking are stored in relevant database MySQL database, can include the file of various data types inside packet, Such as JSON, picture, video, audio frequency etc. file.The detailed process for crawling is：After user input network address, through DNS service Device, finds server host, sends a request to server, and server is sent to the browser of user after parsing The files such as HTML, JS, CSS, browser resolves are out.Therefore, the webpage that user sees substantially is made up of HTML code, is climbed What worm climbed is these contents, by analyzing and filtering these HTML codes, realizes to picture, word and uploads adnexa Etc. crawling for resource, thus can illustrate etc. that the content of correlation is climbed for packet to website that each big data is concluded the business Take.Thus, the multiple related data packets comprising same body content can be obtained.It is of course also possible to will choose before evaluation operation The packet for having obtained is being estimated, rather than is crawled in evaluation operation in real time.

Similarity assessment module is used for determining packet to be assessed, and determines between packet to be assessed and other packets Similarity, choose with the similarity between packet to be assessed higher than predetermined threshold packet as comparing packet.Phase May include like degree evaluation module：Entry document matrix sets up unit, and packet to be assessed is read with the text compared in packet Enter in R LISP program LISPs, the text in each packet is split into by list by participle instrument or user-defined word segmentation regulation Individual word, determines Feature Words and counts the word frequency that each Feature Words occurs, and set up document entry matrix；Similarity Measure list Unit, calculates packet to be assessed and the similarity compared between packet based on below equation：

Wherein, G is the similarity between packet to be assessed and other packets, and scope is [0,1]；N₁, N₂…N_mWith M₁, M₂…M_mThe number of times that each Feature Words in packet respectively to be assessed and other packets occur.Predetermined threshold can be 0.5, when G is more than 0.5, represent that packet to be assessed has similarity with packet is compared；When G is more than 0.85, expression is treated Assessment packet is with to compare packet highly similar.

Scarcity evaluation module is used for utilizing default processing method to determine the scarcity of packet to be assessed, especially by Equation below assesses the scarcity of packet to be assessed：

Wherein, f is the scarcity score of packet to be assessed, and span is [0,1]；Y be except packet to be assessed with All data strip number sums in other outer packets；X is the data strip number in packet to be assessed.As f=0, expression is treated Data in assessment packet are not very rare；As f=1, represent that the data in packet to be assessed compare data at other Do not exist in bag, very rare.

It should be noted that being related to several factors in terms of the value assessment of data file, need to consider each factor The final valuation of data file can be just drawn, the one side of simply estimated data's scarcity that the present invention is provided, is data file Valuation provide a reference frame.

To sum up, present invention introduces the scarcity analysis method of economics category carries out valuation to data assets, with more preferable For Data Market behavior service, promote Data Market transaction and the quick landing of data items.

Those skilled in the art are it should be appreciated that embodiments herein can be provided as method, system or computer program Product.Therefore, the application can adopt complete hardware embodiment, complete software embodiment or with reference to software and hardware in terms of reality Apply the form of example.And, the application can be adopted in one or more computers for wherein including computer usable program code The upper computer program that implements of usable storage medium (including but not limited to disk memory, CD-ROM, optical memory etc.) is produced The form of product.

Although having been described for the preferred embodiment of the application, those skilled in the art once know basic creation Property concept, then can make other change and modification to these embodiments.So, claims are intended to be construed to include excellent Select embodiment and fall into the had altered of the application scope and change.Obviously, those skilled in the art can be to the application Embodiment carries out the spirit and scope of various changes and modification without deviating from the embodiment of the present application.So, if the application is implemented These modifications of example and modification belong within the scope of the application claim and its equivalent technologies, then the application is also intended to include Including these changes and modification.

Claims

1. a kind of packet scarcity appraisal procedure, it is characterised in that include：

S100：Obtain multiple related data packets related to given content；

S200：Determine packet to be assessed, and determine the similarity between packet to be assessed and other packets, choose and treat Similarity between assessment packet higher than predetermined threshold packet as comparing packet；

S300：The scarcity of packet to be assessed is determined using default processing method, is assessed especially by equation below to be evaluated Estimate the scarcity of packet：

f = \frac{2 e^{- y / x}}{1 + e^{- y / x}}

Wherein, f is the scarcity score of packet to be assessed, and span is [0,1]；Y is in addition to packet to be assessed All data strip number sums in other packets；X is the data strip number in packet to be assessed.

2. method according to claim 1, it is characterised in that calculated using text similarity measurement algorithm in step s 200 and treated Similarity between assessment packet and other packets, specifically includes：

S210：Packet to be assessed and the text compared in packet are read in R LISP program LISPs, by participle instrument or use Text in each packet is split into single word by the word segmentation regulation of family definition, is determined Feature Words and is counted each Feature Words The word frequency of appearance, and set up document entry matrix；

G = \frac{(N_{1} \times M_{1}) + (N_{2} \times M_{2}) + ... + (N_{m} \times M_{m})}{\sqrt{N_{1}^{2} + N_{2}^{2} + ... + N_{m}^{2}} \times \sqrt{M_{1}^{2} + M_{2}^{2} + ... + M_{m}^{2}}}

Wherein, G is the similarity between packet to be assessed and other packets, and scope is [0,1]；N₁, N₂…N_mAnd M₁, M₂… M_mThe number of times that each Feature Words in packet respectively to be assessed and other packets occur.

3. method according to claim 2, it is characterised in that when G is more than 0.5, represents packet to be assessed and compares Packet has similarity；When G is more than 0.85, packet to be assessed is represented with to compare packet highly similar.

4. method according to claim 1, it is characterised in that as f=0, represents data in packet to be assessed not Rare；As f=1, represent that the data in packet to be assessed compare in packet at other and do not exist, very rare.

5. method according to claim 1, it is characterised in that by crawling the network data of the multiple data platforms in the Internet To obtain multiple related data packets related to given content.

6. a kind of packet scarcity assessment system, it is characterised in that include：

Data acquisition module, obtains multiple related data packets related to given content；

Similarity assessment module, determines packet to be assessed, and determines similar between packet to be assessed and other packets Degree, choose with the similarity between packet to be assessed higher than predetermined threshold packet as comparing packet；

Scarcity evaluation module, determines the scarcity of packet to be assessed using default processing method, especially by following public affairs Formula assesses the scarcity of packet to be assessed：

f = \frac{2 e^{- y / x}}{1 + e^{- y / x}}

7. system according to claim 6, it is characterised in that the similarity assessment module includes：

Entry document matrix sets up unit, and packet to be assessed and the text compared in packet are read in R LISP program LISPs, Text in each packet is split into by single word by participle instrument or user-defined word segmentation regulation, Feature Words are determined The word frequency that each Feature Words occurs is counted and, and set up document entry matrix；

Similarity calculated, calculates packet to be assessed and the similarity compared between packet based on below equation：

G = \frac{(N_{1} \times M_{1}) + (N_{2} \times M_{2}) + ... + (N_{m} \times M_{m})}{\sqrt{N_{1}^{2} + N_{2}^{2} + ... + N_{m}^{2}} \times \sqrt{M_{1}^{2} + M_{2}^{2} + ... + M_{m}^{2}}}

8. system according to claim 7, it is characterised in that when G is more than 0.5, represents packet to be assessed and compares Packet has similarity；When G is more than 0.85, packet to be assessed is represented with to compare packet highly similar.

9. system according to claim 6, it is characterised in that as f=0, represents data in packet to be assessed not Rare；As f=1, represent that the data in packet to be assessed compare in packet at other and do not exist, very rare.

10. system according to claim 6, it is characterised in that the data acquisition module passes through by crawling the Internet The network data of multiple data platforms is obtaining multiple related data packets related to given content.