CN112463804A

CN112463804A - KDTree-based image database data processing method

Info

Publication number: CN112463804A
Application number: CN202110139298.4A
Authority: CN
Inventors: 王浩; 秦拯; 陈嘉欣; 欧露
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2021-02-02
Filing date: 2021-02-02
Publication date: 2021-03-09
Anticipated expiration: 2041-02-02
Also published as: CN112463804B

Abstract

The invention discloses an image database data processing method based on a KDTree, which comprises the following steps: step one, traversing and integrating map marking information based on a KDTree to obtain a marking set S = { S1, S2, …, sn }; secondly, sensitive information detection based on word similarity is carried out on the label set S, and sensitivity grading is carried out on the map label content; and thirdly, performing corresponding desensitization treatment according to the sensitivity level of the map labeling content. The invention utilizes the position information of map label, realizes the automatic processing of traversal and integration of label content in the geographic space data and desensitization of sensitive information, and overcomes the phenomena of fussy work, low efficiency, easy error and vulnerability under the existing manual processing.

Description

KDTree-based image database data processing method

Technical Field

The invention relates to the field of geographic information data processing, in particular to a KDTree-based image database data processing method.

Background

In recent years, the rapid development of computer technology has promoted the development and progress of geographic information system technology. A Geographic Information System (GIS) relates to multiple disciplines such as geography, mapping, computer science and technology, takes a computer as a tool, takes geospatial data as a research object, integrates the unique visualization effect and the Geographic analysis function of a map with database operation, and provides decision Information for multiple industries and departments such as geography, planning and management.

Currently, with the development of mobile internet, the demand of people for services based on geographic information is increasing during daily trips. The wide application of various electronic maps brings about not only convenience for work and life of people, but also the problem of potential safety hazard. Among them, the security protection of map labeling information is a problem worthy of research. Some sensitive contents, such as national strategic resources, military banners, defense facilities, etc., may be included in the map annotation information. In this regard, countries have come up with a plurality of legal provisions and policies, such as "supplementary provisions for content representation of public maps" (trial implementation), "provisions for public representation of basic geographic information" (trial implementation), etc., which explicitly specify the contents that can be represented and cannot be represented in public maps, and thus enhance the work of geographic information security protection from the legal and policy level. At present, each organization unit adopts an intranet isolation technical means to ensure data security, and requires desensitization processing on sensitive data in an intranet before releasing geographic information to an extranet.

In a real scene, the labeled content on the map is stored in the form of an attribute table, in order to make the expression effect beautiful and complete, a part of the label is composed of a plurality of single characters, for example, a plurality of records composed of single characters in the attribute table represent a place name, and when a keyword is searched on the map, part of the labeled content is easy to miss. In this case, the existing desensitization processing work has to rely on manual inspection of the labeled content of each region in the map, and the examination, identification and processing are performed on the labeled content, which still has the problem of content omission, and the map region above the market level is large, the content is complex, the manual processing work is tedious, and the efficiency is low. Therefore, a method for integrating the labeling information in the map and performing desensitization processing based on the computer technology is urgently needed to be researched, and the method has an important significance for maintaining the safety of the geographic information.

For the labeled content composed of multiple single characters, the common division method is lexical analysis, i.e. converting a character sequence into a word sequence, segmenting a received string of continuous characters into single words, and then matching the obtained words with a sensitive word bank to further detect sensitive information. However, when conventional adding, deleting or modifying operations are performed on a plurality of single words contained in the label, the single words may be disorderly and repeatedly arranged in the attribute table, and the situation cannot be handled only by simple word segmentation.

Although the arrangement of the label content in the attribute table is irregular, each character in the map is associated with a position coordinate, and the label fields to be integrated have strong relevance in position distribution, such as closer arrangement, top to bottom, left to right, almost on a straight line, and the like. The invention utilizes the position information of map labeling to realize the traversal, integration and desensitization treatment of sensitive information of the labeled content in the geospatial data.

The noun explains: KDTree (k-dimensional tree): is a tree data structure that stores instance points in k-dimensional space for fast retrieval thereof.

jieba word segmentation: a very popular Chinese open source word segmentation packet has the characteristics of high performance, accuracy, expandability and the like, mainly supports python at present, and has related versions in other languages.

word2 vec: is a cluster of correlation models used to generate word vectors. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic word text. The network is represented by words and the input words in adjacent positions are guessed, and the order of the words is unimportant under the assumption of the bag-of-words model in word2 vec. After training is completed, the word2vec model can be used to map each word to a vector, which can be used to represent word-to-word relationships, and the vector is a hidden layer of the neural network.

Disclosure of Invention

The invention provides an image database data processing method based on a KDTree (k-dimensional tree). The invention utilizes the position information of map label, realizes the automatic processing of traversal and integration of label content in the geographic space data and desensitization of sensitive information, and overcomes the phenomena of fussy work, low efficiency, easy error and vulnerability under the existing manual processing.

In order to achieve the purpose, the technical scheme of the invention is as follows:

a KDTree-based image database data processing method comprises the following steps:

step one, traversing and integrating map labeling information based on KDTree to obtain a labeling set S = { S } formed by n labels₁,s₂,…,s_n}; sn represents the nth label; the method specifically comprises the following steps:

1.1, extracting position coordinates and character contents of all single characters on a map in an attribute table of the map to form a data set, and constructing a KDTree according to a two-dimensional coordinate of each character;

1.2, arranging all the single characters according to the sequence of the ordinate from big to small to obtain a primary queue Q; creating a mark array vis [ ]) for recording whether each single character in the queue Q is processed or not, initializing the mark array to 0, and traversing the queue Q until the queue Q is empty;

1.3, processing the single characters in sequence according to the arrangement sequence of the single characters in the primary queue Q;

if the current single character point p is not processed, namely vis [ p ] =0, executing 1.4 steps, and juxtaposing vis [ p ] as 1;

if the current single sub-point p is processed, namely vis [ p ] =1, jumping to the next point in the queue Q;

1.4 searching points which are within a threshold value of [0, epsilon ] from the range of the current single character point p in the constructed KDTree to obtain a neighbor node set of the current single character point p, wherein epsilon represents a parameter of an integration range, and 1.5-2 times of the word width corresponding to the previous single character point p is taken; searching a single character point q meeting the integration condition with the current single character point p in the neighbor node set according to the sequence of the distance points p from near to far, if the single character point q is found successfully, replacing the current single character point with q, and juxtaposing vis [ q ] as 1; the single character point q is the point corresponding to the single character q in the KDTree;

1.5 repeating the step 1.4 until no single character point which can be integrated with the current single character point exists in the neighbor node set, and taking the integrated single character point as a label;

1.6 when no single character point which can be integrated with the current single character point exists in the neighbor node set, processing the next unprocessed single character according to the arrangement sequence of the single characters in the primary queue Q;

1.7 repeating the steps 1.3-1.6 to finish processing the single characters in the preliminary queue Q, and obtaining each label on the map.

Secondly, sensitive information detection based on word similarity is carried out on the label set S, and sensitivity grading is carried out on the map label content;

and thirdly, performing corresponding desensitization treatment according to the sensitivity level of the map labeling content.

In a further improvement, in step 1.4, the integration conditions are as follows:

the single character point q is not processed, namely vis [ q ] = 0;

in the first case, when the integrated field only contains one current single character point p, the current single character point p is integrated with a single character point q which is closest to the previous single character point p in the neighbor node set; when only a single character point p exists in the neighbor node set, the single character point p forms a label;

and secondly, when the integrated field contains two or more words, namely, when the field formed by a plurality of words is integrated with the single word point q, judging whether all the words in the new field s formed by the single word point q and the integrated field are in the same straight line and whether the range R of the array formed by the distance between every two adjacent words meets the following conditions:

(1)

wherein Len represents the number of the single characters contained in the new field s,

indicates the new field siIndividual character point and jth individual character pointjThe distance of (a) to (b),

and

respectively representing taking a maximum value and a minimum value; theta is 0.2-0.5 times of the width of the word in the new field s;

if all of the 2 integration conditions are satisfied, the fields formed by the plurality of words are integrated with the single word point q, and if at least one of the integration conditions is not satisfied, the fields formed by the plurality of words are not integrated with the single word point q.

In a further improvement, before integration, the interference of repeated words is firstly eliminated, if word frames corresponding to p and q are intersected and the contents of the words are the same, the p and q are the repeated words, and q is deleted from the attribute table to realize duplication elimination.

In a further improvement, for the horizontal distribution labeling in step 1.7, the single characters in the labeling are arranged from left to right in order from small to large on the abscissa.

In a further improvement, the second step includes the following steps:

2.1: performing word segmentation on the marked content by adopting a Chinese word segmentation technology:

aiming at each piece of labeled content si of the label set S, converting the labeled content si into a plurality of word vectors by adopting a Chinese word segmentation technology and a word vector construction technology; obtaining si = { a1, a2, …, am }, wherein a1 … am is m feature words obtained after division;

2.2: converting the characteristic words and the sensitive words into word vectors by adopting a word vector construction technology:

converting the characteristic words into word vectors by using word2vec, and recording the word vectors after j-th characteristic word Aj conversion as Aj; similarly, all sensitive words Bk in the sensitive word stock are converted into word vectors which are recorded as Bk; the similarity degree of the feature words and the sensitive words is quantized into the similarity degree of the feature word vectors and the sensitive word vectors, namely, the cosine value of an included angle of inner product spaces of the two vectors is taken as the similarity degree, the value range is [0,1], and the closer the similarity degree is to 1, the greater the similarity degree of the two words is;

(2)

representing the similarity between Aj and Bk;

2.3 sensitivity calculation of feature words:

in the sensitive word bank, each sensitive word ck corresponds to one sensitive level

，

The larger the value is, the higher the sensitivity degree of the sensitive word is; traversing the sensitive word stock, and defining the maximum sensitivity of the characteristic words aj for the characteristic words aj

Is composed of

(3)

Wherein,

representing a sensitive word bank, calculating the product of the similarity of each sensitive word vector and the feature word vector and the sensitivity level, and taking the maximum value to represent the maximum sensitivity of the feature words;

setting a threshold parameter theta when

When the value is larger than theta, the characteristic word aj has sensitivity, otherwise, the characteristic word aj is not considered as a sensitive word, the sensitivity is marked as 0, namely, the sensitivity of the characteristic word is

(4)

2.4 sensitivity calculation for annotated content:

defining the ith annotation

Sensitivity of (2)

Comprises the following steps:

(5)

in the formula, notation

Containing m feature words, j represents a label

The jth feature word of (1) is a label

Sensitivity of (2) is a label

The sum of the sensitivity accumulation of the contained m characteristic words; after the sensitivities of the feature words and the labeled contents are obtained, the labeled contents are divided into 4 levels of high sensitivity, medium sensitivity, low sensitivity and non-sensitivity according to the sensitivities.

In a further improvement, the third step includes the following steps:

3.1, constructing a white list of the geographic marking information, and adding the non-sensitive data into the white list every time when the non-sensitive data which is wrongly identified by the algorithm is manually found, so that the fault tolerance rate is improved;

3.2 after the sensitive data are screened by the white list, the marked content with high sensitive level is directly deleted; labeling the medium and low sensitivity levels, extracting the sensitivity characteristic words in the labels, randomly selecting a desensitization means of deletion, replacement and generalization for processing, then recalculating the desensitization labeled sensitivity, and completely deleting the corresponding labels if the desensitization labeled sensitivity does not meet the public requirement after iteration preset times; when the non-sensitive word with the maximum similarity to the current sensitive characteristic word is selected as the replacement in the selection replacement, the specific content of the label description is abstracted when the generalization operation is selected, so that the description range comprises more non-sensitive information.

The invention has the advantages that:

the invention utilizes the position information of map label, realizes the automatic processing of traversal and integration of label content in the geographic space data and desensitization of sensitive information, and overcomes the phenomena of fussy work, low efficiency, easy error and vulnerability under the existing manual processing.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

Fig. 1 shows a KDTree-based image database data processing method, which includes the following steps:

(1) traversing and integrating map labeling information based on KDTree;

(2) sensitive information detection is carried out based on word similarity;

(3) desensitization is performed based on the sensitivity of the signature.

The specific contents are as follows:

(1) a map labeling information traversal and integration method based on a KDTree comprises the following steps:

the method utilizes the relevance of map labeling information on position distribution, considers points formed by all single characters, greedily considers that another point closest to the point of the current single character in a certain range can meet the condition that the two are integrated into a word, and considers other points in the range when the condition of integrating into the word cannot be met, and the specific flow is as follows:

the method comprises the following steps: and extracting the position coordinates and character contents of all the single characters on the map in the attribute table, and constructing a KDTree according to the two-dimensional coordinates of each character. KDTree is a data structure that partitions k-dimensional data space and is commonly used for range search and nearest neighbor search.

Step two: because people often read from top to bottom and from left to right, all the single characters are arranged in the descending order of the ordinate before the marked content is integrated, so that the subsequently accessed single characters are added into the currently integrated field in sequence to obtain a primary queue Q. A tag array vis is created, recording whether each word has been processed, initialized to 0. Traverse queue Q until the queue is empty:

step three: if the current point p is processed, namely vis [ p ] =1, jumping to the next point in the queue; if vis [ p ] =0, searching a point which is in a threshold value [0, epsilon ] from the range of the current single character point p in the KDTree to obtain an adjacent node set of the current single character point p, searching a single character point q which meets the integration condition with the current single character point p in the adjacent node set according to the sequence of the distance points p from near to far, if the finding is successful, replacing the current single character point with q, and juxtaposing vis [ q ] as 1; step four: repeating the third step until no single character point which can be integrated with the current single character point exists in the neighbor node set, taking the integrated single character point as a label, and processing the next unprocessed single character according to the arrangement sequence of the single characters in the preliminary queue Q;

step five: and repeating the third step to the fourth step until the single characters in the preliminary queue Q are all processed to obtain all the marks on the map.

In addition, when an unprocessed point is integrated into a processed field, it is necessary to satisfy that all words in the formed new field s are in the same straight line and the distance between every two adjacent words is close, that is, the position distribution on the map satisfies the condition of forming a label, otherwise, the integration is not performed. In the integration, the ordinate is almost unchanged, and the field with the larger abscissa change needs to be placed at the front end of the labeled content when the field with the smaller abscissa change is combined, namely, the field is arranged from left to right.

And after the queue Q is empty, all the single characters are processed, and the label set obtained after integration is recorded as S = { S = (S) }₁,s₂,…,s_n}。

(2) A sensitive information detection method based on word similarity comprises the following steps:

for the label set S, in order to protect data security, sensitive information detection based on word similarity is required, and the method mainly comprises the following four steps:

the method comprises the following steps: performing word segmentation on the marked content by adopting a Chinese word segmentation technology:

first, for each piece of annotation content S of the annotation set S_iThe Chinese word segmentation technology and the word vector construction technology are adopted to convert the Chinese word segmentation technology and the word vector construction technology into a plurality of word vectors. The method divides map annotation into a plurality of words by using jieba word segmentation to obtain s_i={a₁,a₂,…,a_m}，a₁…a_mM feature words obtained after division.

Step two: converting the characteristic words and the sensitive words into word vectors by adopting a word vector construction technology:

the invention uses word2vec to convert the feature words into word vectors, each feature word a_jThe converted word vector is marked as A_j. Similarly, all the sensitive words b in the sensitive word bank are combined_kConverted into word vectors, denoted B_k. At the moment, the similarity degree of the feature words and the sensitive words can be quantized into the similarity degree of the feature word vectors and the sensitive word vectors, namely, the cosine value of the included angle of the inner product space of the two vectors is taken as the similarity degree

The value range is [0,1]]The closer the similarity is to 1, the greater the degree of similarity between the two words.

Step three: sensitivity calculation of the characteristic words:

in the sensitive word bank, each sensitive word c_kCorresponding to a sensitivity level

The larger the value of L, the higher the sensitivity of the sensitive word. Traversing the sensitive word stock to the characteristic word a_jDefining its maximum sensitivity as

Wherein,

representing the sensitive word stock, calculating the product of the similarity of each sensitive word vector and the characteristic word vector and the sensitivity level, and taking the maximum value to represent the characteristic word

The maximum sensitivity of (c).

Setting a threshold parameter theta when

When greater than theta, the feature word

Has sensitivity, otherwise is not considered as a sensitive word, and the sensitivity is marked as 0, namely a characteristic word

Has a sensitivity of

Step four: sensitivity calculation of the annotation content:

since a label can be composed of a plurality of feature words, the sensitivity of the label needs to be measured in consideration of the sensitivities of all the feature words. In addition, the characteristic words are often distributed near the tail of the annotation content, and the sensitivity of the characteristic words has a greater influence on the sensitivity of the annotation. For example, "the armed police hospital in Hunan province", although "armed police" is a sensitive word, the hospital is open to the public and does not belong to sensitive labeled contents. For example, the term "nuclear power plant", "military base", etc. is used to determine the sensitivity of the tag. Therefore, considering the location distribution of the feature words, the sensitivity of the label is defined as:

in the formula, notation

Containing m feature words, j represents

Is the jth feature word of the annotation,

to its sensitivity，

Is the cumulative sum of 1,2, …, m. After the sensitivities of the feature words and the labeled contents are obtained, the labeled contents can be divided into 4 levels of high, medium, low and non-sensitivity according to the sensitivities.

(3) A desensitization method based on feature word sensitivity.

After the sensitive data in the marked content is detected according to the method (2), the sensitive data needs to be processed according to the sensitive level.

The method comprises the following steps: for some labeled contents with insensitive data, because the labeled contents possibly contain characteristic words similar to the sensitive words and are mistakenly recognized as the sensitive data by the algorithm, a white list of the geographic labeling information can be constructed, and each time the non-sensitive data mistakenly recognized by the algorithm is manually discovered, the non-sensitive data is added into the white list, so that the fault tolerance rate of the model is continuously improved.

Step two: in order to ensure the availability of map data, it is not desirable to delete all sensitive data. After the sensitive data identified by the algorithm is screened by the white list, the sensitivity of the sensitive data can be reduced to a non-sensitive level by adopting some desensitization means so as to meet the requirement of external publication. For the labeled content with high sensitivity level, the security of the geographic information is very easy to threaten after the labeled content is leaked, so that a direct deletion measure is adopted. And (3) labeling the medium and low sensitivity levels, extracting the sensitive characteristic words in the medium and low sensitivity levels, randomly selecting desensitization means such as deletion, replacement and generalization to process, recalculating the labeling sensitivity, and deleting the label if the labeling sensitivity does not meet the public requirement after iteration for a certain number of times.

During replacement, the non-sensitive word with the maximum similarity to the current sensitive characteristic word is selected from the non-sensitive word stock to be replaced, and generalization operation is to abstract the concrete content of the label description so that the description range comprises more non-sensitive information, for example, the 'liberty military logistics base' is transformed into a 'warehouse' after generalization and replacement operation.

Claims

1. A KDTree-based image database data processing method is characterized by comprising the following steps:

step one, traversing and integrating map labeling information based on KDTree to obtain a labeling set S = { S } formed by n labels₁,s₂,…,s_n}; sn represents the nth label;

the method specifically comprises the following steps:

1.6 when no single character point which can be integrated with the current single character point exists in the neighbor node set; processing the next unprocessed single character according to the arrangement sequence of the single characters in the preliminary queue Q;

1.7 repeating the steps 1.3-1.6 until the single characters in the preliminary queue Q are all processed; obtaining each label on the map;

2. The KDTree-based image database data processing method according to claim 1, wherein in step 1.4, the integration condition is as follows:

the single character point q is not processed, namely vis [ q ] = 0;

(1)

and

3. The KDTree-based image database data processing method according to claim 2, wherein before integration, interference of duplicate words is first eliminated, if word frames corresponding to p and q intersect and the contents of the words are the same, then p and q are duplicate words, and q is deleted from the attribute table to realize deduplication.

4. The KDTree-based image database data processing method according to claim 1, wherein in step 1.7, for horizontally distributed labels, the individual characters in the label are arranged in order from small to large on the abscissa, and in order from left to right.

5. The KDTree-based image database data processing method of claim 1, wherein the second step comprises the steps of: