CN115618809A

CN115618809A - Character grouping method based on binary character frequency and safe word stock construction method

Info

Publication number: CN115618809A
Application number: CN202211416943.3A
Authority: CN
Inventors: 田辉; 朱鹏远; 郭玉刚; 张志翔
Original assignee: Hefei High Dimensional Data Technology Co ltd
Current assignee: Hefei High Dimensional Data Technology Co ltd
Priority date: 2022-11-14
Filing date: 2022-11-14
Publication date: 2023-01-17

Abstract

The invention particularly relates to a character grouping method based on binary character frequency and a safe word stock construction method, wherein the character grouping method comprises the following steps: traversing the corpus, and counting the occurrence times of any two characters in the N characters to be grouped to obtain a binary character frequency matrix; traversing the characters one by one from high to low according to the character frequency, and calculating the weight of the character c to be distributed to the kth group according to a formula; and adding the character c to be allocated to the group with the largest weight, and the like until all the characters are grouped. The binary character frequency matrix reflects the frequency of the two characters appearing together, the weight of the two characters appearing together frequently is increased when the two characters are divided into different groups through a weight calculation formula, so that the characters appearing together can be divided into different groups as much as possible by selecting the group with the largest weight, and reasonable grouping of the characters is realized.

Description

Character grouping method based on binary character frequency and safe word stock construction method

Technical Field

The invention relates to the technical field of word stock invisible watermarks, in particular to a character grouping method and a safe word stock construction method based on binary character frequency.

Background

In the existing text watermarking technology, in order to improve the robustness of a watermarking algorithm against malicious attacks such as printing and scanning, screen capture, screen shooting and the like, a text digital watermarking technology based on character topological structure modification becomes a mainstream. The character deformation data is stored in a specific watermark font library by corresponding to different watermark information bit strings after the specific characters are deformed in different forms, and the watermark information is embedded by font replacement in the process of printing and outputting electronic text documents and displaying screens. When we use different character deformation data for different users, the specific watermark word stock constitutes the safe word stock for the user.

The prior secure word stock has many defects, and in order to solve the problems of poor universality of watermark loading, poor system stability, complex implementation process, low robustness performance of a watermark algorithm and the like in the prior art on the premise of not changing any use habit of a user, the following scheme is disclosed in a patent of a universal text watermarking method and device (publication number: CN 114708133A) applied by Beijing national crypto-technology Limited company: a general text watermarking method, comprising the steps of: grouping a certain number of characters in the selected word stock according to a specific strategy; performing deformation design on all characters in each group according to a specific rule, and generating a watermark character data temporary file; generating watermark coding data of the user terminal to identify the identity authentication information of the user terminal; according to the watermark coding data, combining the watermark character data temporary file and the grouped characters, dynamically generating and loading a watermark character library file in real time; and running the text file in an electronic format, and embedding watermark information in the document content data printed out of the file and displayed on a screen in real time by using the watermark font library file.

In this scheme, characters need to be grouped. When characters are grouped, theoretically, the characters with higher word frequency should be respectively located in different groups; the characters, which often appear together, should be located in different groups, respectively. The safety word stock generated by meeting the two requirements needs less word content when the safety code is extracted, so that the extraction effect and the accuracy are better. The character grouping method in the scheme has a plurality of defects: first, the number of characters in each group is substantially equal, which conflicts with the above requirement; secondly, only the word frequency is considered during grouping, the word frequency is not fully considered, theoretically, corresponding characters in frequently-occurring words should be grouped into different groups, so that more groups can appear in shorter contents, and fewer contents are required during extraction of the security codes; thirdly, the calculation process for optimizing the packets in the scheme is too complex, and a large amount of time and calculation power are consumed.

Disclosure of Invention

The invention mainly aims to provide a character grouping method and a safe character library construction method based on binary character frequency, which can more reasonably group characters.

In order to realize the purpose, the invention adopts the technical scheme that: a character grouping method based on binary character frequency comprises the following steps: traversing the corpus, and counting the occurrence times of any two characters in the N characters to be grouped to obtain a binary character frequency matrix

Binary character frequency matrix

Of (2) element(s)

Representing characters

Followed by characters

The frequency of (2); traversing the characters one by one according to the word frequency from high to low, and calculating the weight of the character c to be distributed to the kth group according to the following formula:

wherein A is a set of grouped characters and characters c to be allocated,

and

is a constant greater than 0 and

(ii) a And adding the character c to be allocated to the group with the largest weight, and the like until all the characters are grouped.

Compared with the prior art, the invention has the following technical effects: the grouping scheme mainly groups characters from the association between binary characters, for two characters which often appear together, the characters are distributed in different groups as much as possible, the binary character frequency matrix reflects the frequency of the two characters appearing together, then the weights of the two character groups which often appear together in different groups are increased through a weight calculation formula, so that the characters which appear together can be grouped as different as possible by selecting the group with the largest weight, reasonable grouping of the characters is realized, and the number of the characters in each group is not limited by the grouping mode, so that the character grouping is more reasonable.

The second purpose of the present invention is to provide a method for constructing a secure word stock based on the above character grouping method, which improves the applicability and reliability of the secure word stock.

In order to achieve the above purpose, the first technical scheme adopted by the invention is as follows: a method for constructing a secure word stock comprises the following steps: selecting the first N characters according to the character frequency sequence, and performing deformation design on the N characters to obtain deformed characters, wherein the standard characters and the deformed characters of each character represent 0 and 1 respectively; dividing the N characters into K groups according to the steps of claim 1, each character belonging to only one group, K being the number of bits of a binary string encoded by a security code represented by a security word stock; for any one safety code, the standard word or the deformed word corresponding to each character is selected according to the binary number corresponding to the group of each character, and the standard word or the deformed word of the selected N characters and the standard words of other unselected characters form the safety word bank corresponding to the safety code.

In order to achieve the above purpose, the second technical scheme adopted by the invention is as follows: a method for constructing a secure word stock comprises the following steps: selecting the first N characters according to the word frequency sequence, and performing deformation design on the N characters respectively to obtain deformed words; binary coding is carried out on the standard word and the deformed word of each character, the bit number x of the binary coding and the number of the deformed word of the character

The following formula is satisfied:

(ii) a Dividing the N characters into K groups according to the steps in claim 1, wherein the group number of each character is equal to the bit number x of the binary code corresponding to the character, and K is the bit number of the binary character string coded by the security code represented by the security word stock; for any one safety code, the binary number corresponding to the group of each character is used as the binary code to select the standard word or the deformed word corresponding to the character, and the standard word or the deformed word of the selected N characters and the standard words of the unselected other characters form the safety word library corresponding to the safety code.

Compared with the prior art, the two safe word stock construction methods have the following technical effects: because the character grouping is more reasonable, the safety word stock constructed based on the character grouping method is more reliable inevitably, and meanwhile, the safety word stock formed by putting each character in a certain group has higher reliability; the security word stock formed by placing the single character in a plurality of groups has less average required character number when extracting the security code and stronger applicability.

Drawings

FIG. 1 is a flow chart of a character grouping method of the present invention;

FIG. 2 is a flowchart of a method for constructing a secure word stock according to an embodiment of the present invention;

FIG. 3 is a flowchart of a method for constructing a secure word stock according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to fig. 1 to 3.

Referring to fig. 1, the present invention discloses a character grouping method based on binary character frequency, comprising the following steps: traversing the corpus, and counting the occurrence times of any two characters in the N characters to be grouped to obtain a binary character frequency matrix

Binary character frequency matrix

Of (2) element(s)

Representing characters

Followed by characters

The frequency of (2); the N characters to be grouped are the first N characters generally selected from high to low according to the character frequency, N is generally 1000 to 3000, the binary character frequency reflects that the characters are

Followed by characters

The frequency of which characters are not considered here

And characters

Whether or not a word belongs to is considered only from the positional relationship. For example, the combined-fertilizer high-dimensional data technology company is a novel network for a sentenceThe network security company is created by 5 months in 2014 by profound professors and schoolmates of China science and major network security colleges, 1 is recorded for the frequency of the character "fertilizer" followed by the character "fertilizer", 1 is recorded for the frequency of the character "high" followed by the character "fertilizer", and so on, even if the character "high fertilizer" is not a word, statistics needs to be carried out. In specific statistics, the symbols can be ignored, and at the moment, aiming at the 'department, by', the frequency of the character 'department' is followed by 'from' plus 1; the symbol may also be considered when, for "si, by", the character "si" is followed by a comma, and the character "by" is preceded by a comma, without increasing the frequency with which the character "si" is followed by a "by".

The corpus can be selected according to the requirements of users, namely a general corpus can be selected, an internal corpus of a certain enterprise or organization can be selected, and the obtained character groups are different for different corpora.

Traversing the characters one by one according to the word frequency from high to low, and calculating the weight of the character c to be distributed to the kth group according to the following formula:

wherein A is a set of grouped characters and characters c to be allocated,

and

is a constant greater than 0 and

. By analyzing the weights

When the number of the grouped characters is less than the number of the groups, the characters to be distributed are distributed to the empty groups,

maximum; when characters are allocated in all groups, two characters that often appear together are allocated in different groups,

maximum; both of these cases are the desired assignment, so we only need to add the character c to be assigned to the group with the greatest weight to put the character in the best grouping, and so on until all characters have been grouped. Weight of

The introduction of the method converts the grouping problem into specific calculation and judgment, so that the problem can be solved better, and the method is very convenient.

The grouping scheme mainly groups characters from the association between binary characters, for two characters which often appear together, the characters are distributed in different groups as much as possible, the binary character frequency matrix reflects the frequency of the two characters appearing together, then the weights of the two character groups which often appear together in different groups are increased through a weight calculation formula, so that the characters which appear together can be grouped as different as possible by selecting the group with the largest weight, reasonable grouping of the characters is realized, and the number of the characters in each group is not limited by the grouping mode, so that the character grouping is more reasonable.

Some elements may result due to statistics of frequency

The value of (2) is very large, which is inconvenient for subsequent calculation, and in the invention, preferably, the step of traversing the characters one by one from high to low according to the word frequency further comprises the following steps: for binary character frequency matrix

Normalizing to obtain a binary character frequency matrix

(ii) a The weight of the character c to be distributed after being distributed to each group is calculated according to the following formula:

according to a binary character frequency matrix

The calculation amount is much smaller when the weight is calculated, and the calculation is more convenient.

Further, the pair of binary character frequency matrixes

Normalizing to obtain a binary character frequency matrix

Normalized by any of the following formulas:

or

。

The above two formulas represent two different normalization modes respectively. In the first formula, the normalized binary character frequency matrix

The sum of the probabilities of each row of (1) represents: the sum of the probabilities that a certain character is followed by other characters is 1; in the second formula, the normalized binary character frequency matrix

The sum of the probabilities of each column of (1) represents: the sum of the probabilities of a character preceded by other characters is 1, which can also be understood as the inverse processThere are characters. The specific normalization method can be selected according to actual needs, because the normalization operation does not affect the grouping result.

During calculation, a plurality of groups with the maximum weight can appear with extremely low probability, so that the invention also adds the following judgment logic to ensure that each character can be reliably grouped. The step of adding the character c to be assigned to the group with the maximum weight comprises the following steps: if the group with the maximum weight is only one, adding the character c to be distributed into the group; if the weight of the group is maximum, selecting the group with the least number of characters in all the groups with the maximum weight; if only one group with the least number of characters exists, adding the character c to be distributed into the group; if there are more groups with the least number of characters, the character c to be allocated is randomly added to any one of the groups.

And

there are many kinds of values mentioned, and it is preferable in the present invention that said

，

. Such values are more favorable for calculation

The difference between the two can be conveniently judged

Size of (2), how to optimize specifically

The calculation process is not the focus of the present application, and will not be described herein again.

The invention also discloses two safe word stock construction methods, which are as follows.

Referring to fig. 2, in the first embodiment, the method for constructing the secure word stock includes the following steps: selecting the first N characters according to the character frequency sequence, and performing deformation design on the N characters to obtain deformed characters, wherein the standard characters and the deformed characters of each character represent 0 and 1 respectively; dividing the N characters into K groups according to the steps of claim 1, each character belonging to only one group, K being the number of bits of a binary string encoded by a security code represented by a security word stock; for any one safety code, the standard word or the deformed word corresponding to each character is selected according to the binary number corresponding to the group of each character, and the standard word or the deformed word of the selected N characters and the standard words of other unselected characters form the safety word bank corresponding to the safety code.

Referring to fig. 3, in the second embodiment, the method for constructing the secure word stock includes the following steps: selecting the first N characters according to the word frequency sequence, and performing deformation design on the N characters respectively to obtain deformed words; binary coding is carried out on the standard word and the deformed word of each character, the bit number x of the binary coding and the number of the deformed word of the character

The following formula is satisfied:

The basic principle of the two schemes of the safe word stock construction method is the same, and the only difference is that: in the first embodiment, each character is only classified into one group; in the second embodiment, the characters with high word frequency can be divided into a plurality of groups. Because the character grouping is more reasonable, the safety word stock constructed based on the character grouping method is more reliable inevitably, and meanwhile, the safety word stock formed by putting each character in a certain group has higher reliability; the security word stock formed by placing the single character in a plurality of groups has less average required character number when extracting the security code and stronger applicability.

The invention also discloses a computer readable storage medium and an electronic device. Wherein a computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the binary character frequency based character grouping method as described above or implements the secure font building method as described above. An electronic device comprising a memory, a processor and a computer program stored on the memory, wherein the processor implements the character grouping method based on binary character frequency or implements the secure word stock construction method when executing the computer program.

Claims

1. A character grouping method based on binary character frequency is characterized in that: the method comprises the following steps:

traversing the corpus, and counting the occurrence times of any two characters in the N characters to be grouped to obtain a binary character frequency matrix

Binary character frequency matrix

Of (2) element(s)

Representing characters

Followed by characters

The frequency of (2);

wherein A is a set of grouped characters and characters c to be allocated,

and

is a constant greater than 0 and

；

and adding the character c to be allocated to the group with the largest weight, and the like until all the characters are grouped.

2. The binary character frequency-based character grouping method as claimed in claim 1, wherein: before the step of traversing the characters one by one from high to low according to the word frequency, the method also comprises the following steps: for binary character frequency matrix

Normalizing to obtain a binary character frequency matrix

。

3. the binary character frequency-based character grouping method as claimed in claim 2, wherein: the frequency matrix of the binary character

Normalizing to obtain a binary character frequency matrix

Normalized by any of the following formulas:

or

。

4. The binary character frequency-based character grouping method as claimed in claim 1, wherein: the step of adding the character c to be assigned to the group with the maximum weight comprises the following steps:

if the group with the maximum weight is only one, adding the character c to be distributed into the group;

if the group with the maximum weight is multiple, selecting the group with the minimum number of characters in all the groups with the maximum weight;

if only one group with the least number of characters exists, adding the character c to be distributed into the group;

if there are more groups with the least number of characters, the character c to be allocated is randomly added to any one of the groups.

5. The binary character frequency-based character grouping method as claimed in claim 1, wherein: said

，

。

6. A method for constructing a secure word stock is characterized by comprising the following steps: the method comprises the following steps:

selecting the first N characters according to the character frequency sequence, and performing deformation design on the N characters to obtain deformed characters, wherein the standard characters and the deformed characters of each character represent 0 and 1 respectively;

dividing the N characters into K groups according to the steps of claim 1, each character belonging to only one group, K being the number of bits of a binary string encoded by a security code represented by a security word stock;

for any one safety code, the standard word or the deformed word corresponding to each character is selected according to the binary number corresponding to the group of each character, and the standard word or the deformed word of the selected N characters and the standard words of other unselected characters form the safety word bank corresponding to the safety code.

7. A method for constructing a secure word stock is characterized by comprising the following steps: the method comprises the following steps:

selecting the first N characters according to the word frequency sequence, and performing deformation design on the N characters respectively to obtain deformed words;

binary coding is carried out on the standard word and the deformed word of each character, the bit number x of the binary coding and the number of the deformed word of the character

The following formula is satisfied:

；

dividing the N characters into K groups according to the steps in claim 1, wherein the group number of each character is equal to the bit number x of the binary code corresponding to the character, and K is the bit number of the binary character string coded by the security code represented by the security word stock;

for any one safety code, the binary number corresponding to the group of each character is used as the binary code to select the standard word or the deformed word corresponding to the character, and the standard word or the deformed word of the selected N characters and the standard words of the unselected other characters form the safety word library corresponding to the safety code.

8. A computer-readable storage medium characterized by: stored thereon is a computer program which, when executed by a processor, implements the binary character frequency based character grouping method as claimed in any one of claims 1 to 5 or implements the secure font building method as claimed in claim 6 or 7.

9. An electronic device, characterized in that: comprising a memory, a processor and a computer program stored on the memory, the processor implementing the binary character frequency based character grouping method according to any one of claims 1 to 5 or implementing the secure font building method according to claim 6 or 7 when executing the computer program.