CN112463804A - KDTree-based image database data processing method - Google Patents

KDTree-based image database data processing method Download PDF

Info

Publication number
CN112463804A
CN112463804A CN202110139298.4A CN202110139298A CN112463804A CN 112463804 A CN112463804 A CN 112463804A CN 202110139298 A CN202110139298 A CN 202110139298A CN 112463804 A CN112463804 A CN 112463804A
Authority
CN
China
Prior art keywords
word
sensitivity
words
sensitive
single character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110139298.4A
Other languages
Chinese (zh)
Other versions
CN112463804B (en
Inventor
王浩
秦拯
陈嘉欣
欧露
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Original Assignee
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University filed Critical Hunan University
Priority to CN202110139298.4A priority Critical patent/CN112463804B/en
Publication of CN112463804A publication Critical patent/CN112463804A/en
Application granted granted Critical
Publication of CN112463804B publication Critical patent/CN112463804B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • G06F21/6254Protecting personal data, e.g. for financial or medical purposes by anonymising data, e.g. decorrelating personal data from the owner's identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • Data Mining & Analysis (AREA)
  • Medical Informatics (AREA)
  • Remote Sensing (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an image database data processing method based on a KDTree, which comprises the following steps: step one, traversing and integrating map marking information based on a KDTree to obtain a marking set S = { S1, S2, …, sn }; secondly, sensitive information detection based on word similarity is carried out on the label set S, and sensitivity grading is carried out on the map label content; and thirdly, performing corresponding desensitization treatment according to the sensitivity level of the map labeling content. The invention utilizes the position information of map label, realizes the automatic processing of traversal and integration of label content in the geographic space data and desensitization of sensitive information, and overcomes the phenomena of fussy work, low efficiency, easy error and vulnerability under the existing manual processing.

Description

KDTree-based image database data processing method
Technical Field
The invention relates to the field of geographic information data processing, in particular to a KDTree-based image database data processing method.
Background
In recent years, the rapid development of computer technology has promoted the development and progress of geographic information system technology. A Geographic Information System (GIS) relates to multiple disciplines such as geography, mapping, computer science and technology, takes a computer as a tool, takes geospatial data as a research object, integrates the unique visualization effect and the Geographic analysis function of a map with database operation, and provides decision Information for multiple industries and departments such as geography, planning and management.
Currently, with the development of mobile internet, the demand of people for services based on geographic information is increasing during daily trips. The wide application of various electronic maps brings about not only convenience for work and life of people, but also the problem of potential safety hazard. Among them, the security protection of map labeling information is a problem worthy of research. Some sensitive contents, such as national strategic resources, military banners, defense facilities, etc., may be included in the map annotation information. In this regard, countries have come up with a plurality of legal provisions and policies, such as "supplementary provisions for content representation of public maps" (trial implementation), "provisions for public representation of basic geographic information" (trial implementation), etc., which explicitly specify the contents that can be represented and cannot be represented in public maps, and thus enhance the work of geographic information security protection from the legal and policy level. At present, each organization unit adopts an intranet isolation technical means to ensure data security, and requires desensitization processing on sensitive data in an intranet before releasing geographic information to an extranet.
In a real scene, the labeled content on the map is stored in the form of an attribute table, in order to make the expression effect beautiful and complete, a part of the label is composed of a plurality of single characters, for example, a plurality of records composed of single characters in the attribute table represent a place name, and when a keyword is searched on the map, part of the labeled content is easy to miss. In this case, the existing desensitization processing work has to rely on manual inspection of the labeled content of each region in the map, and the examination, identification and processing are performed on the labeled content, which still has the problem of content omission, and the map region above the market level is large, the content is complex, the manual processing work is tedious, and the efficiency is low. Therefore, a method for integrating the labeling information in the map and performing desensitization processing based on the computer technology is urgently needed to be researched, and the method has an important significance for maintaining the safety of the geographic information.
For the labeled content composed of multiple single characters, the common division method is lexical analysis, i.e. converting a character sequence into a word sequence, segmenting a received string of continuous characters into single words, and then matching the obtained words with a sensitive word bank to further detect sensitive information. However, when conventional adding, deleting or modifying operations are performed on a plurality of single words contained in the label, the single words may be disorderly and repeatedly arranged in the attribute table, and the situation cannot be handled only by simple word segmentation.
Although the arrangement of the label content in the attribute table is irregular, each character in the map is associated with a position coordinate, and the label fields to be integrated have strong relevance in position distribution, such as closer arrangement, top to bottom, left to right, almost on a straight line, and the like. The invention utilizes the position information of map labeling to realize the traversal, integration and desensitization treatment of sensitive information of the labeled content in the geospatial data.
The noun explains: KDTree (k-dimensional tree): is a tree data structure that stores instance points in k-dimensional space for fast retrieval thereof.
jieba word segmentation: a very popular Chinese open source word segmentation packet has the characteristics of high performance, accuracy, expandability and the like, mainly supports python at present, and has related versions in other languages.
word2 vec: is a cluster of correlation models used to generate word vectors. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic word text. The network is represented by words and the input words in adjacent positions are guessed, and the order of the words is unimportant under the assumption of the bag-of-words model in word2 vec. After training is completed, the word2vec model can be used to map each word to a vector, which can be used to represent word-to-word relationships, and the vector is a hidden layer of the neural network.
Disclosure of Invention
The invention provides an image database data processing method based on a KDTree (k-dimensional tree). The invention utilizes the position information of map label, realizes the automatic processing of traversal and integration of label content in the geographic space data and desensitization of sensitive information, and overcomes the phenomena of fussy work, low efficiency, easy error and vulnerability under the existing manual processing.
In order to achieve the purpose, the technical scheme of the invention is as follows:
a KDTree-based image database data processing method comprises the following steps:
step one, traversing and integrating map labeling information based on KDTree to obtain a labeling set S = { S } formed by n labels1,s2,…,sn}; sn represents the nth label; the method specifically comprises the following steps:
1.1, extracting position coordinates and character contents of all single characters on a map in an attribute table of the map to form a data set, and constructing a KDTree according to a two-dimensional coordinate of each character;
1.2, arranging all the single characters according to the sequence of the ordinate from big to small to obtain a primary queue Q; creating a mark array vis [ ]) for recording whether each single character in the queue Q is processed or not, initializing the mark array to 0, and traversing the queue Q until the queue Q is empty;
1.3, processing the single characters in sequence according to the arrangement sequence of the single characters in the primary queue Q;
if the current single character point p is not processed, namely vis [ p ] =0, executing 1.4 steps, and juxtaposing vis [ p ] as 1;
if the current single sub-point p is processed, namely vis [ p ] =1, jumping to the next point in the queue Q;
1.4 searching points which are within a threshold value of [0, epsilon ] from the range of the current single character point p in the constructed KDTree to obtain a neighbor node set of the current single character point p, wherein epsilon represents a parameter of an integration range, and 1.5-2 times of the word width corresponding to the previous single character point p is taken; searching a single character point q meeting the integration condition with the current single character point p in the neighbor node set according to the sequence of the distance points p from near to far, if the single character point q is found successfully, replacing the current single character point with q, and juxtaposing vis [ q ] as 1; the single character point q is the point corresponding to the single character q in the KDTree;
1.5 repeating the step 1.4 until no single character point which can be integrated with the current single character point exists in the neighbor node set, and taking the integrated single character point as a label;
1.6 when no single character point which can be integrated with the current single character point exists in the neighbor node set, processing the next unprocessed single character according to the arrangement sequence of the single characters in the primary queue Q;
1.7 repeating the steps 1.3-1.6 to finish processing the single characters in the preliminary queue Q, and obtaining each label on the map.
Secondly, sensitive information detection based on word similarity is carried out on the label set S, and sensitivity grading is carried out on the map label content;
and thirdly, performing corresponding desensitization treatment according to the sensitivity level of the map labeling content.
In a further improvement, in step 1.4, the integration conditions are as follows:
the single character point q is not processed, namely vis [ q ] = 0;
in the first case, when the integrated field only contains one current single character point p, the current single character point p is integrated with a single character point q which is closest to the previous single character point p in the neighbor node set; when only a single character point p exists in the neighbor node set, the single character point p forms a label;
and secondly, when the integrated field contains two or more words, namely, when the field formed by a plurality of words is integrated with the single word point q, judging whether all the words in the new field s formed by the single word point q and the integrated field are in the same straight line and whether the range R of the array formed by the distance between every two adjacent words meets the following conditions:
Figure 557812DEST_PATH_IMAGE001
(1)
wherein Len represents the number of the single characters contained in the new field s,
Figure 391776DEST_PATH_IMAGE002
indicates the new field siIndividual character point and jth individual character pointjThe distance of (a) to (b),
Figure 342414DEST_PATH_IMAGE003
and
Figure 37969DEST_PATH_IMAGE004
respectively representing taking a maximum value and a minimum value; theta is 0.2-0.5 times of the width of the word in the new field s;
if all of the 2 integration conditions are satisfied, the fields formed by the plurality of words are integrated with the single word point q, and if at least one of the integration conditions is not satisfied, the fields formed by the plurality of words are not integrated with the single word point q.
In a further improvement, before integration, the interference of repeated words is firstly eliminated, if word frames corresponding to p and q are intersected and the contents of the words are the same, the p and q are the repeated words, and q is deleted from the attribute table to realize duplication elimination.
In a further improvement, for the horizontal distribution labeling in step 1.7, the single characters in the labeling are arranged from left to right in order from small to large on the abscissa.
In a further improvement, the second step includes the following steps:
2.1: performing word segmentation on the marked content by adopting a Chinese word segmentation technology:
aiming at each piece of labeled content si of the label set S, converting the labeled content si into a plurality of word vectors by adopting a Chinese word segmentation technology and a word vector construction technology; obtaining si = { a1, a2, …, am }, wherein a1 … am is m feature words obtained after division;
2.2: converting the characteristic words and the sensitive words into word vectors by adopting a word vector construction technology:
converting the characteristic words into word vectors by using word2vec, and recording the word vectors after j-th characteristic word Aj conversion as Aj; similarly, all sensitive words Bk in the sensitive word stock are converted into word vectors which are recorded as Bk; the similarity degree of the feature words and the sensitive words is quantized into the similarity degree of the feature word vectors and the sensitive word vectors, namely, the cosine value of an included angle of inner product spaces of the two vectors is taken as the similarity degree, the value range is [0,1], and the closer the similarity degree is to 1, the greater the similarity degree of the two words is;
Figure 390453DEST_PATH_IMAGE005
(2)
Figure 395318DEST_PATH_IMAGE006
representing the similarity between Aj and Bk;
2.3 sensitivity calculation of feature words:
in the sensitive word bank, each sensitive word ck corresponds to one sensitive level
Figure 833253DEST_PATH_IMAGE007
Figure 565454DEST_PATH_IMAGE007
The larger the value is, the higher the sensitivity degree of the sensitive word is; traversing the sensitive word stock, and defining the maximum sensitivity of the characteristic words aj for the characteristic words aj
Figure 38024DEST_PATH_IMAGE008
Is composed of
Figure 151473DEST_PATH_IMAGE009
(3)
Wherein,
Figure 139021DEST_PATH_IMAGE010
representing a sensitive word bank, calculating the product of the similarity of each sensitive word vector and the feature word vector and the sensitivity level, and taking the maximum value to represent the maximum sensitivity of the feature words;
setting a threshold parameter theta when
Figure 363329DEST_PATH_IMAGE008
When the value is larger than theta, the characteristic word aj has sensitivity, otherwise, the characteristic word aj is not considered as a sensitive word, the sensitivity is marked as 0, namely, the sensitivity of the characteristic word is
Figure 237875DEST_PATH_IMAGE011
(4)
2.4 sensitivity calculation for annotated content:
defining the ith annotation
Figure 522226DEST_PATH_IMAGE012
Sensitivity of (2)
Figure 997070DEST_PATH_IMAGE013
Comprises the following steps:
Figure 25069DEST_PATH_IMAGE014
(5)
in the formula, notation
Figure 518236DEST_PATH_IMAGE012
Containing m feature words, j represents a label
Figure 707909DEST_PATH_IMAGE012
The jth feature word of (1) is a label
Figure 670048DEST_PATH_IMAGE012
Sensitivity of (2) is a label
Figure 236159DEST_PATH_IMAGE012
The sum of the sensitivity accumulation of the contained m characteristic words; after the sensitivities of the feature words and the labeled contents are obtained, the labeled contents are divided into 4 levels of high sensitivity, medium sensitivity, low sensitivity and non-sensitivity according to the sensitivities.
In a further improvement, the third step includes the following steps:
3.1, constructing a white list of the geographic marking information, and adding the non-sensitive data into the white list every time when the non-sensitive data which is wrongly identified by the algorithm is manually found, so that the fault tolerance rate is improved;
3.2 after the sensitive data are screened by the white list, the marked content with high sensitive level is directly deleted; labeling the medium and low sensitivity levels, extracting the sensitivity characteristic words in the labels, randomly selecting a desensitization means of deletion, replacement and generalization for processing, then recalculating the desensitization labeled sensitivity, and completely deleting the corresponding labels if the desensitization labeled sensitivity does not meet the public requirement after iteration preset times; when the non-sensitive word with the maximum similarity to the current sensitive characteristic word is selected as the replacement in the selection replacement, the specific content of the label description is abstracted when the generalization operation is selected, so that the description range comprises more non-sensitive information.
The invention has the advantages that:
the invention utilizes the position information of map label, realizes the automatic processing of traversal and integration of label content in the geographic space data and desensitization of sensitive information, and overcomes the phenomena of fussy work, low efficiency, easy error and vulnerability under the existing manual processing.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
Fig. 1 shows a KDTree-based image database data processing method, which includes the following steps:
(1) traversing and integrating map labeling information based on KDTree;
(2) sensitive information detection is carried out based on word similarity;
(3) desensitization is performed based on the sensitivity of the signature.
The specific contents are as follows:
(1) a map labeling information traversal and integration method based on a KDTree comprises the following steps:
the method utilizes the relevance of map labeling information on position distribution, considers points formed by all single characters, greedily considers that another point closest to the point of the current single character in a certain range can meet the condition that the two are integrated into a word, and considers other points in the range when the condition of integrating into the word cannot be met, and the specific flow is as follows:
the method comprises the following steps: and extracting the position coordinates and character contents of all the single characters on the map in the attribute table, and constructing a KDTree according to the two-dimensional coordinates of each character. KDTree is a data structure that partitions k-dimensional data space and is commonly used for range search and nearest neighbor search.
Step two: because people often read from top to bottom and from left to right, all the single characters are arranged in the descending order of the ordinate before the marked content is integrated, so that the subsequently accessed single characters are added into the currently integrated field in sequence to obtain a primary queue Q. A tag array vis is created, recording whether each word has been processed, initialized to 0. Traverse queue Q until the queue is empty:
step three: if the current point p is processed, namely vis [ p ] =1, jumping to the next point in the queue; if vis [ p ] =0, searching a point which is in a threshold value [0, epsilon ] from the range of the current single character point p in the KDTree to obtain an adjacent node set of the current single character point p, searching a single character point q which meets the integration condition with the current single character point p in the adjacent node set according to the sequence of the distance points p from near to far, if the finding is successful, replacing the current single character point with q, and juxtaposing vis [ q ] as 1; step four: repeating the third step until no single character point which can be integrated with the current single character point exists in the neighbor node set, taking the integrated single character point as a label, and processing the next unprocessed single character according to the arrangement sequence of the single characters in the preliminary queue Q;
step five: and repeating the third step to the fourth step until the single characters in the preliminary queue Q are all processed to obtain all the marks on the map.
In addition, when an unprocessed point is integrated into a processed field, it is necessary to satisfy that all words in the formed new field s are in the same straight line and the distance between every two adjacent words is close, that is, the position distribution on the map satisfies the condition of forming a label, otherwise, the integration is not performed. In the integration, the ordinate is almost unchanged, and the field with the larger abscissa change needs to be placed at the front end of the labeled content when the field with the smaller abscissa change is combined, namely, the field is arranged from left to right.
And after the queue Q is empty, all the single characters are processed, and the label set obtained after integration is recorded as S = { S = (S) }1,s2,…,sn}。
(2) A sensitive information detection method based on word similarity comprises the following steps:
for the label set S, in order to protect data security, sensitive information detection based on word similarity is required, and the method mainly comprises the following four steps:
the method comprises the following steps: performing word segmentation on the marked content by adopting a Chinese word segmentation technology:
first, for each piece of annotation content S of the annotation set SiThe Chinese word segmentation technology and the word vector construction technology are adopted to convert the Chinese word segmentation technology and the word vector construction technology into a plurality of word vectors. The method divides map annotation into a plurality of words by using jieba word segmentation to obtain si={a1,a2,…,am},a1…amM feature words obtained after division.
Step two: converting the characteristic words and the sensitive words into word vectors by adopting a word vector construction technology:
the invention uses word2vec to convert the feature words into word vectors, each feature word ajThe converted word vector is marked as Aj. Similarly, all the sensitive words b in the sensitive word bank are combinedkConverted into word vectors, denoted Bk. At the moment, the similarity degree of the feature words and the sensitive words can be quantized into the similarity degree of the feature word vectors and the sensitive word vectors, namely, the cosine value of the included angle of the inner product space of the two vectors is taken as the similarity degree
Figure 350877DEST_PATH_IMAGE015
The value range is [0,1]]The closer the similarity is to 1, the greater the degree of similarity between the two words.
Figure 711451DEST_PATH_IMAGE016
Step three: sensitivity calculation of the characteristic words:
in the sensitive word bank, each sensitive word ckCorresponding to a sensitivity level
Figure 98570DEST_PATH_IMAGE017
The larger the value of L, the higher the sensitivity of the sensitive word. Traversing the sensitive word stock to the characteristic word ajDefining its maximum sensitivity as
Figure 530688DEST_PATH_IMAGE018
Wherein,
Figure 421284DEST_PATH_IMAGE019
representing the sensitive word stock, calculating the product of the similarity of each sensitive word vector and the characteristic word vector and the sensitivity level, and taking the maximum value to represent the characteristic word
Figure 529923DEST_PATH_IMAGE020
The maximum sensitivity of (c).
Setting a threshold parameter theta when
Figure 138759DEST_PATH_IMAGE021
When greater than theta, the feature word
Figure 374568DEST_PATH_IMAGE020
Has sensitivity, otherwise is not considered as a sensitive word, and the sensitivity is marked as 0, namely a characteristic word
Figure 385249DEST_PATH_IMAGE020
Has a sensitivity of
Figure 166255DEST_PATH_IMAGE022
Step four: sensitivity calculation of the annotation content:
since a label can be composed of a plurality of feature words, the sensitivity of the label needs to be measured in consideration of the sensitivities of all the feature words. In addition, the characteristic words are often distributed near the tail of the annotation content, and the sensitivity of the characteristic words has a greater influence on the sensitivity of the annotation. For example, "the armed police hospital in Hunan province", although "armed police" is a sensitive word, the hospital is open to the public and does not belong to sensitive labeled contents. For example, the term "nuclear power plant", "military base", etc. is used to determine the sensitivity of the tag. Therefore, considering the location distribution of the feature words, the sensitivity of the label is defined as:
Figure 262387DEST_PATH_IMAGE023
in the formula, notation
Figure 36308DEST_PATH_IMAGE024
Containing m feature words, j represents
Figure 901495DEST_PATH_IMAGE020
Is the jth feature word of the annotation,
Figure 40353DEST_PATH_IMAGE025
to its sensitivity
Figure 935365DEST_PATH_IMAGE026
Is the cumulative sum of 1,2, …, m. After the sensitivities of the feature words and the labeled contents are obtained, the labeled contents can be divided into 4 levels of high, medium, low and non-sensitivity according to the sensitivities.
(3) A desensitization method based on feature word sensitivity.
After the sensitive data in the marked content is detected according to the method (2), the sensitive data needs to be processed according to the sensitive level.
The method comprises the following steps: for some labeled contents with insensitive data, because the labeled contents possibly contain characteristic words similar to the sensitive words and are mistakenly recognized as the sensitive data by the algorithm, a white list of the geographic labeling information can be constructed, and each time the non-sensitive data mistakenly recognized by the algorithm is manually discovered, the non-sensitive data is added into the white list, so that the fault tolerance rate of the model is continuously improved.
Step two: in order to ensure the availability of map data, it is not desirable to delete all sensitive data. After the sensitive data identified by the algorithm is screened by the white list, the sensitivity of the sensitive data can be reduced to a non-sensitive level by adopting some desensitization means so as to meet the requirement of external publication. For the labeled content with high sensitivity level, the security of the geographic information is very easy to threaten after the labeled content is leaked, so that a direct deletion measure is adopted. And (3) labeling the medium and low sensitivity levels, extracting the sensitive characteristic words in the medium and low sensitivity levels, randomly selecting desensitization means such as deletion, replacement and generalization to process, recalculating the labeling sensitivity, and deleting the label if the labeling sensitivity does not meet the public requirement after iteration for a certain number of times.
During replacement, the non-sensitive word with the maximum similarity to the current sensitive characteristic word is selected from the non-sensitive word stock to be replaced, and generalization operation is to abstract the concrete content of the label description so that the description range comprises more non-sensitive information, for example, the 'liberty military logistics base' is transformed into a 'warehouse' after generalization and replacement operation.

Claims (6)

1. A KDTree-based image database data processing method is characterized by comprising the following steps:
step one, traversing and integrating map labeling information based on KDTree to obtain a labeling set S = { S } formed by n labels1,s2,…,sn}; sn represents the nth label;
the method specifically comprises the following steps:
1.1, extracting position coordinates and character contents of all single characters on a map in an attribute table of the map to form a data set, and constructing a KDTree according to a two-dimensional coordinate of each character;
1.2, arranging all the single characters according to the sequence of the ordinate from big to small to obtain a primary queue Q; creating a mark array vis [ ]) for recording whether each single character in the queue Q is processed or not, initializing the mark array to 0, and traversing the queue Q until the queue Q is empty;
1.3, processing the single characters in sequence according to the arrangement sequence of the single characters in the primary queue Q;
if the current single character point p is not processed, namely vis [ p ] =0, executing 1.4 steps, and juxtaposing vis [ p ] as 1;
if the current single sub-point p is processed, namely vis [ p ] =1, jumping to the next point in the queue Q;
1.4 searching points which are within a threshold value of [0, epsilon ] from the range of the current single character point p in the constructed KDTree to obtain a neighbor node set of the current single character point p, wherein epsilon represents a parameter of an integration range, and 1.5-2 times of the word width corresponding to the previous single character point p is taken; searching a single character point q meeting the integration condition with the current single character point p in the neighbor node set according to the sequence of the distance points p from near to far, if the single character point q is found successfully, replacing the current single character point with q, and juxtaposing vis [ q ] as 1; the single character point q is the point corresponding to the single character q in the KDTree;
1.5 repeating the step 1.4 until no single character point which can be integrated with the current single character point exists in the neighbor node set, and taking the integrated single character point as a label;
1.6 when no single character point which can be integrated with the current single character point exists in the neighbor node set; processing the next unprocessed single character according to the arrangement sequence of the single characters in the preliminary queue Q;
1.7 repeating the steps 1.3-1.6 until the single characters in the preliminary queue Q are all processed; obtaining each label on the map;
secondly, sensitive information detection based on word similarity is carried out on the label set S, and sensitivity grading is carried out on the map label content;
and thirdly, performing corresponding desensitization treatment according to the sensitivity level of the map labeling content.
2. The KDTree-based image database data processing method according to claim 1, wherein in step 1.4, the integration condition is as follows:
the single character point q is not processed, namely vis [ q ] = 0;
in the first case, when the integrated field only contains one current single character point p, the current single character point p is integrated with a single character point q which is closest to the previous single character point p in the neighbor node set; when only a single character point p exists in the neighbor node set, the single character point p forms a label;
and secondly, when the integrated field contains two or more words, namely, when the field formed by a plurality of words is integrated with the single word point q, judging whether all the words in the new field s formed by the single word point q and the integrated field are in the same straight line and whether the range R of the array formed by the distance between every two adjacent words meets the following conditions:
Figure 307293DEST_PATH_IMAGE001
(1)
wherein Len represents the number of the single characters contained in the new field s,
Figure 162117DEST_PATH_IMAGE002
indicates the new field siIndividual character point and jth individual character pointjThe distance of (a) to (b),
Figure 941854DEST_PATH_IMAGE003
and
Figure 547451DEST_PATH_IMAGE004
respectively representing taking a maximum value and a minimum value; theta is 0.2-0.5 times of the width of the word in the new field s;
if all of the 2 integration conditions are satisfied, the fields formed by the plurality of words are integrated with the single word point q, and if at least one of the integration conditions is not satisfied, the fields formed by the plurality of words are not integrated with the single word point q.
3. The KDTree-based image database data processing method according to claim 2, wherein before integration, interference of duplicate words is first eliminated, if word frames corresponding to p and q intersect and the contents of the words are the same, then p and q are duplicate words, and q is deleted from the attribute table to realize deduplication.
4. The KDTree-based image database data processing method according to claim 1, wherein in step 1.7, for horizontally distributed labels, the individual characters in the label are arranged in order from small to large on the abscissa, and in order from left to right.
5. The KDTree-based image database data processing method of claim 1, wherein the second step comprises the steps of:
2.1: performing word segmentation on the marked content by adopting a Chinese word segmentation technology:
aiming at each piece of labeled content si of the label set S, converting the labeled content si into a plurality of word vectors by adopting a Chinese word segmentation technology and a word vector construction technology; obtaining si = { a1, a2, …, am }, wherein a1 … am is m feature words obtained after division;
2.2: converting the characteristic words and the sensitive words into word vectors by adopting a word vector construction technology:
converting the characteristic words into word vectors by using word2vec, and recording the word vectors after j-th characteristic word Aj conversion as Aj; similarly, all sensitive words Bk in the sensitive word stock are converted into word vectors which are recorded as Bk; the similarity degree of the feature words and the sensitive words is quantized into the similarity degree of the feature word vectors and the sensitive word vectors, namely, the cosine value of an included angle of inner product spaces of the two vectors is taken as the similarity degree, the value range is [0,1], and the closer the similarity degree is to 1, the greater the similarity degree of the two words is;
Figure 361823DEST_PATH_IMAGE005
Figure 121969DEST_PATH_IMAGE006
representing the similarity between Aj and Bk;
2.3 sensitivity calculation of feature words:
in the sensitive word bank, each sensitive word ck corresponds to one sensitive level
Figure 389002DEST_PATH_IMAGE007
Figure 774853DEST_PATH_IMAGE007
The larger the value is, the higher the sensitivity degree of the sensitive word is; traversing the sensitive word stock, and defining the maximum sensitivity of the characteristic words aj for the characteristic words aj
Figure 443732DEST_PATH_IMAGE008
Is composed of
Figure 374779DEST_PATH_IMAGE009
Wherein,
Figure 129108DEST_PATH_IMAGE010
representing a sensitive word bank, calculating the product of the similarity of each sensitive word vector and the feature word vector and the sensitivity level, and taking the maximum value to represent the maximum sensitivity of the feature words;
setting a threshold parameter theta when
Figure 53071DEST_PATH_IMAGE008
When the value is larger than theta, the characteristic word aj has sensitivity, otherwise, the characteristic word aj is not considered as a sensitive word, the sensitivity is marked as 0, namely, the sensitivity of the characteristic word is
Figure 514139DEST_PATH_IMAGE011
2.4 sensitivity calculation for annotated content:
defining the ith annotation
Figure 943983DEST_PATH_IMAGE012
Sensitivity of (2)
Figure 372560DEST_PATH_IMAGE013
Comprises the following steps:
Figure 647683DEST_PATH_IMAGE014
in the formula, notation
Figure 228837DEST_PATH_IMAGE012
Containing m feature words, j represents a label
Figure 564004DEST_PATH_IMAGE015
The jth feature word of (1) is a markNote that
Figure 479876DEST_PATH_IMAGE016
Sensitivity of (2) is a label
Figure 558690DEST_PATH_IMAGE012
The sum of the sensitivity accumulation of the contained m characteristic words; after the sensitivities of the feature words and the labeled contents are obtained, the labeled contents are divided into 4 levels of high sensitivity, medium sensitivity, low sensitivity and non-sensitivity according to the sensitivities.
6. The KDTree-based image database data processing method of claim 1, wherein the third step comprises the steps of:
3.1, constructing a white list of the geographic marking information, and adding the non-sensitive data into the white list every time when the non-sensitive data which is wrongly identified by the algorithm is manually found, so that the fault tolerance rate is improved;
3.2 after the sensitive data are screened by the white list, the marked content with high sensitive level is directly deleted; labeling the medium and low sensitivity levels, extracting the sensitivity characteristic words in the labels, randomly selecting a desensitization means of deletion, replacement and generalization for processing, then recalculating the desensitization labeled sensitivity, and completely deleting the corresponding labels if the desensitization labeled sensitivity does not meet the public requirement after iteration preset times; when the non-sensitive word with the maximum similarity to the current sensitive characteristic word is selected as the replacement in the selection replacement, the specific content of the label description is abstracted when the generalization operation is selected, so that the description range comprises more non-sensitive information.
CN202110139298.4A 2021-02-02 2021-02-02 KDTree-based image database data processing method Active CN112463804B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110139298.4A CN112463804B (en) 2021-02-02 2021-02-02 KDTree-based image database data processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110139298.4A CN112463804B (en) 2021-02-02 2021-02-02 KDTree-based image database data processing method

Publications (2)

Publication Number Publication Date
CN112463804A true CN112463804A (en) 2021-03-09
CN112463804B CN112463804B (en) 2021-06-15

Family

ID=74802248

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110139298.4A Active CN112463804B (en) 2021-02-02 2021-02-02 KDTree-based image database data processing method

Country Status (1)

Country Link
CN (1) CN112463804B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115774769A (en) * 2022-11-17 2023-03-10 北京中知智慧科技有限公司 Sensitive word checking processing method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE69725677D1 (en) * 1996-10-25 2003-11-27 Navigation Tech Corp Device and method for storing geographic data on a physical storage medium
US20150339848A1 (en) * 2014-05-20 2015-11-26 Here Global B.V. Method and apparatus for generating a composite indexable linear data structure to permit selection of map elements based on linear elements
CN106874415A (en) * 2017-01-23 2017-06-20 国网山东省电力公司电力科学研究院 Environmental sensitive area database construction method and server based on generalized information system
US20170188067A1 (en) * 2015-12-28 2017-06-29 The Nielsen Company (Us), Llc Methods and apparatus to perform identity matching across audience measurement systems
CN108257119A (en) * 2018-01-08 2018-07-06 浙江大学 A kind of immediate offshore area floating harmful influence detection method for early warning based near ultraviolet image procossing
CN109257385A (en) * 2018-11-16 2019-01-22 重庆邮电大学 A kind of location privacy protection strategy based on difference privacy
CN109446288A (en) * 2018-10-18 2019-03-08 重庆邮电大学 One kind being based on the internet Spark concerning security matters map detection algorithm

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE69725677D1 (en) * 1996-10-25 2003-11-27 Navigation Tech Corp Device and method for storing geographic data on a physical storage medium
US20150339848A1 (en) * 2014-05-20 2015-11-26 Here Global B.V. Method and apparatus for generating a composite indexable linear data structure to permit selection of map elements based on linear elements
US20170188067A1 (en) * 2015-12-28 2017-06-29 The Nielsen Company (Us), Llc Methods and apparatus to perform identity matching across audience measurement systems
CN106874415A (en) * 2017-01-23 2017-06-20 国网山东省电力公司电力科学研究院 Environmental sensitive area database construction method and server based on generalized information system
CN108257119A (en) * 2018-01-08 2018-07-06 浙江大学 A kind of immediate offshore area floating harmful influence detection method for early warning based near ultraviolet image procossing
CN109446288A (en) * 2018-10-18 2019-03-08 重庆邮电大学 One kind being based on the internet Spark concerning security matters map detection algorithm
CN109257385A (en) * 2018-11-16 2019-01-22 重庆邮电大学 A kind of location privacy protection strategy based on difference privacy

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115774769A (en) * 2022-11-17 2023-03-10 北京中知智慧科技有限公司 Sensitive word checking processing method and device

Also Published As

Publication number Publication date
CN112463804B (en) 2021-06-15

Similar Documents

Publication Publication Date Title
EP2812883B1 (en) System and method for semantically annotating images
KR101377389B1 (en) Visual and multi-dimensional search
CN110399515B (en) Picture retrieval method, device and system
CN112256939B (en) Text entity relation extraction method for chemical field
WO2023108980A1 (en) Information push method and device based on text adversarial sample
MX2008013657A (en) Annotation by search.
JP2006508446A (en) Information storage and retrieval method
US11797705B1 (en) Generative adversarial network for named entity recognition
CN106407445B (en) A kind of unstructured data resource identification and localization method based on URL
US10831820B2 (en) Content based image management and selection
Manandhar et al. Learning structural similarity of user interface layouts using graph networks
CN111325030A (en) Text label construction method and device, computer equipment and storage medium
CN108491543A (en) Image search method, image storage method and image indexing system
Li et al. An automatic approach for generating rich, linked geo-metadata from historical map images
CN112035723A (en) Resource library determination method and device, storage medium and electronic device
CN112463804B (en) KDTree-based image database data processing method
US11163761B2 (en) Vector embedding models for relational tables with null or equivalent values
CA3012647A1 (en) Content based image management and selection
Liaqat et al. Applying uncertain frequent pattern mining to improve ranking of retrieved images
Larson et al. Ranking and representation for geographic information retrieval
Zhu et al. Using thesaurus to model keyblock-based image retrieval
Deniziak et al. World wide web CBIR searching using query by approximate shapes
Wu et al. Design of a Computer‐Based Legal Information Retrieval System
CN113392312A (en) Information processing method and system and electronic equipment
Sumathy et al. Image Retrieval and Analysis Using Text and Fuzzy Shape Features: Emerging Research and Opportunities: Emerging Research and Opportunities

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant