CN115618809A - Character grouping method based on binary character frequency and safe word stock construction method - Google Patents
Character grouping method based on binary character frequency and safe word stock construction method Download PDFInfo
- Publication number
- CN115618809A CN115618809A CN202211416943.3A CN202211416943A CN115618809A CN 115618809 A CN115618809 A CN 115618809A CN 202211416943 A CN202211416943 A CN 202211416943A CN 115618809 A CN115618809 A CN 115618809A
- Authority
- CN
- China
- Prior art keywords
- character
- characters
- binary
- word
- frequency
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/103—Formatting, i.e. changing of presentation of documents
- G06F40/109—Font handling; Temporal or kinetic typography
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Pure & Applied Mathematics (AREA)
- Mathematical Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Optimization (AREA)
- Computational Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Algebra (AREA)
- Computational Linguistics (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Document Processing Apparatus (AREA)
Abstract
The invention particularly relates to a character grouping method based on binary character frequency and a safe word stock construction method, wherein the character grouping method comprises the following steps: traversing the corpus, and counting the occurrence times of any two characters in the N characters to be grouped to obtain a binary character frequency matrix; traversing the characters one by one from high to low according to the character frequency, and calculating the weight of the character c to be distributed to the kth group according to a formula; and adding the character c to be allocated to the group with the largest weight, and the like until all the characters are grouped. The binary character frequency matrix reflects the frequency of the two characters appearing together, the weight of the two characters appearing together frequently is increased when the two characters are divided into different groups through a weight calculation formula, so that the characters appearing together can be divided into different groups as much as possible by selecting the group with the largest weight, and reasonable grouping of the characters is realized.
Description
Technical Field
The invention relates to the technical field of word stock invisible watermarks, in particular to a character grouping method and a safe word stock construction method based on binary character frequency.
Background
In the existing text watermarking technology, in order to improve the robustness of a watermarking algorithm against malicious attacks such as printing and scanning, screen capture, screen shooting and the like, a text digital watermarking technology based on character topological structure modification becomes a mainstream. The character deformation data is stored in a specific watermark font library by corresponding to different watermark information bit strings after the specific characters are deformed in different forms, and the watermark information is embedded by font replacement in the process of printing and outputting electronic text documents and displaying screens. When we use different character deformation data for different users, the specific watermark word stock constitutes the safe word stock for the user.
The prior secure word stock has many defects, and in order to solve the problems of poor universality of watermark loading, poor system stability, complex implementation process, low robustness performance of a watermark algorithm and the like in the prior art on the premise of not changing any use habit of a user, the following scheme is disclosed in a patent of a universal text watermarking method and device (publication number: CN 114708133A) applied by Beijing national crypto-technology Limited company: a general text watermarking method, comprising the steps of: grouping a certain number of characters in the selected word stock according to a specific strategy; performing deformation design on all characters in each group according to a specific rule, and generating a watermark character data temporary file; generating watermark coding data of the user terminal to identify the identity authentication information of the user terminal; according to the watermark coding data, combining the watermark character data temporary file and the grouped characters, dynamically generating and loading a watermark character library file in real time; and running the text file in an electronic format, and embedding watermark information in the document content data printed out of the file and displayed on a screen in real time by using the watermark font library file.
In this scheme, characters need to be grouped. When characters are grouped, theoretically, the characters with higher word frequency should be respectively located in different groups; the characters, which often appear together, should be located in different groups, respectively. The safety word stock generated by meeting the two requirements needs less word content when the safety code is extracted, so that the extraction effect and the accuracy are better. The character grouping method in the scheme has a plurality of defects: first, the number of characters in each group is substantially equal, which conflicts with the above requirement; secondly, only the word frequency is considered during grouping, the word frequency is not fully considered, theoretically, corresponding characters in frequently-occurring words should be grouped into different groups, so that more groups can appear in shorter contents, and fewer contents are required during extraction of the security codes; thirdly, the calculation process for optimizing the packets in the scheme is too complex, and a large amount of time and calculation power are consumed.
Disclosure of Invention
The invention mainly aims to provide a character grouping method and a safe character library construction method based on binary character frequency, which can more reasonably group characters.
In order to realize the purpose, the invention adopts the technical scheme that: a character grouping method based on binary character frequency comprises the following steps: traversing the corpus, and counting the occurrence times of any two characters in the N characters to be grouped to obtain a binary character frequency matrixBinary character frequency matrixOf (2) element(s)Representing charactersFollowed by charactersThe frequency of (2); traversing the characters one by one according to the word frequency from high to low, and calculating the weight of the character c to be distributed to the kth group according to the following formula:
wherein A is a set of grouped characters and characters c to be allocated,andis a constant greater than 0 and(ii) a And adding the character c to be allocated to the group with the largest weight, and the like until all the characters are grouped.
Compared with the prior art, the invention has the following technical effects: the grouping scheme mainly groups characters from the association between binary characters, for two characters which often appear together, the characters are distributed in different groups as much as possible, the binary character frequency matrix reflects the frequency of the two characters appearing together, then the weights of the two character groups which often appear together in different groups are increased through a weight calculation formula, so that the characters which appear together can be grouped as different as possible by selecting the group with the largest weight, reasonable grouping of the characters is realized, and the number of the characters in each group is not limited by the grouping mode, so that the character grouping is more reasonable.
The second purpose of the present invention is to provide a method for constructing a secure word stock based on the above character grouping method, which improves the applicability and reliability of the secure word stock.
In order to achieve the above purpose, the first technical scheme adopted by the invention is as follows: a method for constructing a secure word stock comprises the following steps: selecting the first N characters according to the character frequency sequence, and performing deformation design on the N characters to obtain deformed characters, wherein the standard characters and the deformed characters of each character represent 0 and 1 respectively; dividing the N characters into K groups according to the steps of claim 1, each character belonging to only one group, K being the number of bits of a binary string encoded by a security code represented by a security word stock; for any one safety code, the standard word or the deformed word corresponding to each character is selected according to the binary number corresponding to the group of each character, and the standard word or the deformed word of the selected N characters and the standard words of other unselected characters form the safety word bank corresponding to the safety code.
In order to achieve the above purpose, the second technical scheme adopted by the invention is as follows: a method for constructing a secure word stock comprises the following steps: selecting the first N characters according to the word frequency sequence, and performing deformation design on the N characters respectively to obtain deformed words; binary coding is carried out on the standard word and the deformed word of each character, the bit number x of the binary coding and the number of the deformed word of the characterThe following formula is satisfied:(ii) a Dividing the N characters into K groups according to the steps in claim 1, wherein the group number of each character is equal to the bit number x of the binary code corresponding to the character, and K is the bit number of the binary character string coded by the security code represented by the security word stock; for any one safety code, the binary number corresponding to the group of each character is used as the binary code to select the standard word or the deformed word corresponding to the character, and the standard word or the deformed word of the selected N characters and the standard words of the unselected other characters form the safety word library corresponding to the safety code.
Compared with the prior art, the two safe word stock construction methods have the following technical effects: because the character grouping is more reasonable, the safety word stock constructed based on the character grouping method is more reliable inevitably, and meanwhile, the safety word stock formed by putting each character in a certain group has higher reliability; the security word stock formed by placing the single character in a plurality of groups has less average required character number when extracting the security code and stronger applicability.
Drawings
FIG. 1 is a flow chart of a character grouping method of the present invention;
FIG. 2 is a flowchart of a method for constructing a secure word stock according to an embodiment of the present invention;
FIG. 3 is a flowchart of a method for constructing a secure word stock according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to fig. 1 to 3.
Referring to fig. 1, the present invention discloses a character grouping method based on binary character frequency, comprising the following steps: traversing the corpus, and counting the occurrence times of any two characters in the N characters to be grouped to obtain a binary character frequency matrixBinary character frequency matrixOf (2) element(s)Representing charactersFollowed by charactersThe frequency of (2); the N characters to be grouped are the first N characters generally selected from high to low according to the character frequency, N is generally 1000 to 3000, the binary character frequency reflects that the characters areFollowed by charactersThe frequency of which characters are not considered hereAnd charactersWhether or not a word belongs to is considered only from the positional relationship. For example, the combined-fertilizer high-dimensional data technology company is a novel network for a sentenceThe network security company is created by 5 months in 2014 by profound professors and schoolmates of China science and major network security colleges, 1 is recorded for the frequency of the character "fertilizer" followed by the character "fertilizer", 1 is recorded for the frequency of the character "high" followed by the character "fertilizer", and so on, even if the character "high fertilizer" is not a word, statistics needs to be carried out. In specific statistics, the symbols can be ignored, and at the moment, aiming at the 'department, by', the frequency of the character 'department' is followed by 'from' plus 1; the symbol may also be considered when, for "si, by", the character "si" is followed by a comma, and the character "by" is preceded by a comma, without increasing the frequency with which the character "si" is followed by a "by".
The corpus can be selected according to the requirements of users, namely a general corpus can be selected, an internal corpus of a certain enterprise or organization can be selected, and the obtained character groups are different for different corpora.
Traversing the characters one by one according to the word frequency from high to low, and calculating the weight of the character c to be distributed to the kth group according to the following formula:
wherein A is a set of grouped characters and characters c to be allocated,andis a constant greater than 0 and. By analyzing the weightsWhen the number of the grouped characters is less than the number of the groups, the characters to be distributed are distributed to the empty groups,maximum; when characters are allocated in all groups, two characters that often appear together are allocated in different groups,maximum; both of these cases are the desired assignment, so we only need to add the character c to be assigned to the group with the greatest weight to put the character in the best grouping, and so on until all characters have been grouped. Weight ofThe introduction of the method converts the grouping problem into specific calculation and judgment, so that the problem can be solved better, and the method is very convenient.
The grouping scheme mainly groups characters from the association between binary characters, for two characters which often appear together, the characters are distributed in different groups as much as possible, the binary character frequency matrix reflects the frequency of the two characters appearing together, then the weights of the two character groups which often appear together in different groups are increased through a weight calculation formula, so that the characters which appear together can be grouped as different as possible by selecting the group with the largest weight, reasonable grouping of the characters is realized, and the number of the characters in each group is not limited by the grouping mode, so that the character grouping is more reasonable.
Some elements may result due to statistics of frequencyThe value of (2) is very large, which is inconvenient for subsequent calculation, and in the invention, preferably, the step of traversing the characters one by one from high to low according to the word frequency further comprises the following steps: for binary character frequency matrixNormalizing to obtain a binary character frequency matrix(ii) a The weight of the character c to be distributed after being distributed to each group is calculated according to the following formula:
according to a binary character frequency matrixThe calculation amount is much smaller when the weight is calculated, and the calculation is more convenient.
Further, the pair of binary character frequency matrixesNormalizing to obtain a binary character frequency matrixNormalized by any of the following formulas:
The above two formulas represent two different normalization modes respectively. In the first formula, the normalized binary character frequency matrixThe sum of the probabilities of each row of (1) represents: the sum of the probabilities that a certain character is followed by other characters is 1; in the second formula, the normalized binary character frequency matrixThe sum of the probabilities of each column of (1) represents: the sum of the probabilities of a character preceded by other characters is 1, which can also be understood as the inverse processThere are characters. The specific normalization method can be selected according to actual needs, because the normalization operation does not affect the grouping result.
During calculation, a plurality of groups with the maximum weight can appear with extremely low probability, so that the invention also adds the following judgment logic to ensure that each character can be reliably grouped. The step of adding the character c to be assigned to the group with the maximum weight comprises the following steps: if the group with the maximum weight is only one, adding the character c to be distributed into the group; if the weight of the group is maximum, selecting the group with the least number of characters in all the groups with the maximum weight; if only one group with the least number of characters exists, adding the character c to be distributed into the group; if there are more groups with the least number of characters, the character c to be allocated is randomly added to any one of the groups.
Andthere are many kinds of values mentioned, and it is preferable in the present invention that said,. Such values are more favorable for calculationThe difference between the two can be conveniently judgedSize of (2), how to optimize specificallyThe calculation process is not the focus of the present application, and will not be described herein again.
The invention also discloses two safe word stock construction methods, which are as follows.
Referring to fig. 2, in the first embodiment, the method for constructing the secure word stock includes the following steps: selecting the first N characters according to the character frequency sequence, and performing deformation design on the N characters to obtain deformed characters, wherein the standard characters and the deformed characters of each character represent 0 and 1 respectively; dividing the N characters into K groups according to the steps of claim 1, each character belonging to only one group, K being the number of bits of a binary string encoded by a security code represented by a security word stock; for any one safety code, the standard word or the deformed word corresponding to each character is selected according to the binary number corresponding to the group of each character, and the standard word or the deformed word of the selected N characters and the standard words of other unselected characters form the safety word bank corresponding to the safety code.
Referring to fig. 3, in the second embodiment, the method for constructing the secure word stock includes the following steps: selecting the first N characters according to the word frequency sequence, and performing deformation design on the N characters respectively to obtain deformed words; binary coding is carried out on the standard word and the deformed word of each character, the bit number x of the binary coding and the number of the deformed word of the characterThe following formula is satisfied:(ii) a Dividing the N characters into K groups according to the steps in claim 1, wherein the group number of each character is equal to the bit number x of the binary code corresponding to the character, and K is the bit number of the binary character string coded by the security code represented by the security word stock; for any one safety code, the binary number corresponding to the group of each character is used as the binary code to select the standard word or the deformed word corresponding to the character, and the standard word or the deformed word of the selected N characters and the standard words of the unselected other characters form the safety word library corresponding to the safety code.
The basic principle of the two schemes of the safe word stock construction method is the same, and the only difference is that: in the first embodiment, each character is only classified into one group; in the second embodiment, the characters with high word frequency can be divided into a plurality of groups. Because the character grouping is more reasonable, the safety word stock constructed based on the character grouping method is more reliable inevitably, and meanwhile, the safety word stock formed by putting each character in a certain group has higher reliability; the security word stock formed by placing the single character in a plurality of groups has less average required character number when extracting the security code and stronger applicability.
The invention also discloses a computer readable storage medium and an electronic device. Wherein a computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the binary character frequency based character grouping method as described above or implements the secure font building method as described above. An electronic device comprising a memory, a processor and a computer program stored on the memory, wherein the processor implements the character grouping method based on binary character frequency or implements the secure word stock construction method when executing the computer program.
Claims (9)
1. A character grouping method based on binary character frequency is characterized in that: the method comprises the following steps:
traversing the corpus, and counting the occurrence times of any two characters in the N characters to be grouped to obtain a binary character frequency matrixBinary character frequency matrixOf (2) element(s)Representing charactersFollowed by charactersThe frequency of (2);
traversing the characters one by one according to the word frequency from high to low, and calculating the weight of the character c to be distributed to the kth group according to the following formula:
wherein A is a set of grouped characters and characters c to be allocated,andis a constant greater than 0 and;
and adding the character c to be allocated to the group with the largest weight, and the like until all the characters are grouped.
2. The binary character frequency-based character grouping method as claimed in claim 1, wherein: before the step of traversing the characters one by one from high to low according to the word frequency, the method also comprises the following steps: for binary character frequency matrixNormalizing to obtain a binary character frequency matrix(ii) a The weight of the character c to be distributed after being distributed to each group is calculated according to the following formula:
4. The binary character frequency-based character grouping method as claimed in claim 1, wherein: the step of adding the character c to be assigned to the group with the maximum weight comprises the following steps:
if the group with the maximum weight is only one, adding the character c to be distributed into the group;
if the group with the maximum weight is multiple, selecting the group with the minimum number of characters in all the groups with the maximum weight;
if only one group with the least number of characters exists, adding the character c to be distributed into the group;
if there are more groups with the least number of characters, the character c to be allocated is randomly added to any one of the groups.
6. A method for constructing a secure word stock is characterized by comprising the following steps: the method comprises the following steps:
selecting the first N characters according to the character frequency sequence, and performing deformation design on the N characters to obtain deformed characters, wherein the standard characters and the deformed characters of each character represent 0 and 1 respectively;
dividing the N characters into K groups according to the steps of claim 1, each character belonging to only one group, K being the number of bits of a binary string encoded by a security code represented by a security word stock;
for any one safety code, the standard word or the deformed word corresponding to each character is selected according to the binary number corresponding to the group of each character, and the standard word or the deformed word of the selected N characters and the standard words of other unselected characters form the safety word bank corresponding to the safety code.
7. A method for constructing a secure word stock is characterized by comprising the following steps: the method comprises the following steps:
selecting the first N characters according to the word frequency sequence, and performing deformation design on the N characters respectively to obtain deformed words;
binary coding is carried out on the standard word and the deformed word of each character, the bit number x of the binary coding and the number of the deformed word of the characterThe following formula is satisfied:;
dividing the N characters into K groups according to the steps in claim 1, wherein the group number of each character is equal to the bit number x of the binary code corresponding to the character, and K is the bit number of the binary character string coded by the security code represented by the security word stock;
for any one safety code, the binary number corresponding to the group of each character is used as the binary code to select the standard word or the deformed word corresponding to the character, and the standard word or the deformed word of the selected N characters and the standard words of the unselected other characters form the safety word library corresponding to the safety code.
8. A computer-readable storage medium characterized by: stored thereon is a computer program which, when executed by a processor, implements the binary character frequency based character grouping method as claimed in any one of claims 1 to 5 or implements the secure font building method as claimed in claim 6 or 7.
9. An electronic device, characterized in that: comprising a memory, a processor and a computer program stored on the memory, the processor implementing the binary character frequency based character grouping method according to any one of claims 1 to 5 or implementing the secure font building method according to claim 6 or 7 when executing the computer program.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211416943.3A CN115618809A (en) | 2022-11-14 | 2022-11-14 | Character grouping method based on binary character frequency and safe word stock construction method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211416943.3A CN115618809A (en) | 2022-11-14 | 2022-11-14 | Character grouping method based on binary character frequency and safe word stock construction method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115618809A true CN115618809A (en) | 2023-01-17 |
Family
ID=84879139
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211416943.3A Pending CN115618809A (en) | 2022-11-14 | 2022-11-14 | Character grouping method based on binary character frequency and safe word stock construction method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115618809A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116821967A (en) * | 2023-08-30 | 2023-09-29 | 山东远联信息科技有限公司 | Intersection computing method and system for privacy protection |
-
2022
- 2022-11-14 CN CN202211416943.3A patent/CN115618809A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116821967A (en) * | 2023-08-30 | 2023-09-29 | 山东远联信息科技有限公司 | Intersection computing method and system for privacy protection |
CN116821967B (en) * | 2023-08-30 | 2023-11-21 | 山东远联信息科技有限公司 | Intersection computing method and system for privacy protection |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Shirali-Shahreza et al. | A new approach to Persian/Arabic text steganography | |
Roy et al. | A novel approach to format based text steganography | |
Xiang et al. | A novel linguistic steganography based on synonym run-length encoding | |
Gutub et al. | Improved method of Arabic text steganography using the extension ‘Kashida’character | |
CN114708133B (en) | Universal text watermarking method and device | |
CN105303075B (en) | Adaptive Text Watermarking method based on PDF format | |
Gutub et al. | Utilizing diacritic marks for Arabic text steganography | |
CN109785222B (en) | Method for quickly embedding and extracting information of webpage | |
CN112016061A (en) | Excel document data protection method based on robust watermarking technology | |
CN115618809A (en) | Character grouping method based on binary character frequency and safe word stock construction method | |
Ramakrishnan et al. | Text steganography: a novel character‐level embedding algorithm using font attribute | |
CN115689853A (en) | Robust text watermarking method based on Chinese character characteristic modification and grouping | |
CN114356919A (en) | Watermark embedding method, tracing method and device for structured database | |
Alkhafaji et al. | Payload capacity scheme for quran text watermarking based on vowels with kashida | |
CN107220333B (en) | character search method based on Sunday algorithm | |
CN110770725A (en) | Data processing method and device | |
Alanazi et al. | Involving spaces of unicode standard within irreversible Arabic text steganography for practical implementations | |
CN109495275A (en) | Generate the setting method of random verification code | |
Ghilan et al. | Combined Markov model and zero watermarking techniques to enhance content authentication of english text documents | |
WO2024066271A1 (en) | Database watermark embedding method and apparatus, database watermark tracing method and apparatus, and electronic device | |
CN110210224B (en) | Intelligent big data mobile software similarity detection method based on description entropy | |
Rui et al. | A multiple watermarking algorithm for texts mixed Chinese and English | |
CN115455965B (en) | Character grouping method based on word distance word chain, storage medium and electronic equipment | |
Lin et al. | A data hiding scheme on word documents using multiple-base notation system | |
CN114637972A (en) | Watermark embedding and extracting method based on docx format document |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |