CN114708133B - Universal text watermarking method and device - Google Patents

Universal text watermarking method and device Download PDF

Info

Publication number
CN114708133B
CN114708133B CN202210100368.XA CN202210100368A CN114708133B CN 114708133 B CN114708133 B CN 114708133B CN 202210100368 A CN202210100368 A CN 202210100368A CN 114708133 B CN114708133 B CN 114708133B
Authority
CN
China
Prior art keywords
watermark
information
file
characters
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210100368.XA
Other languages
Chinese (zh)
Other versions
CN114708133A (en
Inventor
李公宝
丛升日
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Guoyin Technology Co ltd
Original Assignee
Beijing Guoyin Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Guoyin Technology Co ltd filed Critical Beijing Guoyin Technology Co ltd
Priority to CN202210100368.XA priority Critical patent/CN114708133B/en
Publication of CN114708133A publication Critical patent/CN114708133A/en
Application granted granted Critical
Publication of CN114708133B publication Critical patent/CN114708133B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/0021Image watermarking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/109Font handling; Temporal or kinetic typography

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Editing Of Facsimile Originals (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention relates to a universal text watermarking method and a universal text watermarking device. The method comprises the following steps: grouping a certain number of characters in the selected word stock according to a specific strategy; performing deformation design on all characters in each group according to a specific rule, and generating a watermark character data temporary file; generating watermark coding data of the user terminal to identify the identity authentication information of the user terminal; dynamically generating and loading a watermark font file in real time according to the watermark encoding data and by combining the watermark character data temporary file and the grouped characters; running a text file in an electronic format, and embedding watermark information in document content data printed out of the file and displayed on a screen in real time by using a watermark font file; and acquiring document picture data with hidden watermark information, extracting the watermark information, and tracing a document divulgence source. The text watermarking method has the advantages of strong universality, good compatibility, high stability and simple watermark information embedding process.

Description

Universal text watermarking method and device
Technical Field
The invention belongs to the technical field of document protection and image processing, relates to a method and a device for embedding and extracting a digital watermark, and particularly relates to a universal text watermarking method and a universal text watermarking device.
Background
With the development of electronic commerce and electronic government affairs, enterprises and public institutions, party administration, national security and other relevant departments will process a large amount of text materials, including contracts, secret-related important documents and the like. The research on copyright protection and content security of these text files is an important issue, and digital watermarking technology provides a way for solving the above problems. In addition, many text documents exist not only in a digital form but also in a paper form by means of printing, copying, and the like. With the rapid development of digitization technology, this approach has become quite common, which makes many important or confidential information easily leaked out by printing out paper documents or displaying electronic document screens as a transmission path. Therefore, it is important to research digital watermarking technology based on text files, which can resist print scanning and screen shooting.
In the existing text watermarking technology, in order to improve the robustness of a watermarking algorithm against malicious attacks such as printing and scanning, screen capture, screen shooting and the like, a text digital watermarking technology based on character topological structure modification becomes a mainstream. The specific character is deformed in different forms and then corresponds to different watermark information bit strings, character deformation data can be stored in a specific watermark font library, and in the process of printing output and screen display of an electronic text document, watermark information is embedded through font replacement. Therefore, real-time loading of watermark information is a key step of text watermarking technology based on font replacement. Currently, the loading of the word stock and the embedding of the watermark information are mainly realized through the following modes: 1) Via HOOK technology. In order to be able to replace font information in real time, it is necessary to acquire content data of a text file in advance. The specific operation of the file is generally intercepted by a system hook, then the intermediate format file data is obtained, the watermark information embedding is completed after the font replacing operation is carried out, and finally the normal file operation is executed. For example, the watermark information can be embedded in real time in the process of normally printing and outputting the document through the printing operation of the HOOK electronic text file; or completing the analysis of the content data in the specific file format during the opening operation of the intercepted file, executing font type replacement and embedding watermark information, and finally opening the electronic file embedded with the watermark information. 2) By means of a software plug-in mechanism. In order to replace the word stock in the specific file format content and embed the watermark information, a corresponding software plug-in module can be developed to complete the file content analysis. For example, based on the VBA (Visual Basic for Applications) macro or VSTO (Visual Studio Tools for Office) technology of microsoft Office software, watermark information embedding in common Office format files such as Doc, docx, excel or PPT is realized. When the electronic file with the format is opened or printed out, the office plug-in calls an interface provided by a software system to finish analyzing and modifying the content of the electronic file, so that watermark information embedding is finished after word stock type replacement.
However, the above methods have many problems and drawbacks: 1) The general use is poor. For example, watermark information embedding in the process of file printing operation can be well completed based on the HOOK technology, but the screen display operation of the electronic file is difficult to intercept, and watermark information embedding cannot be carried out in real time in the screen display content; the embedding method based on the software plug-in mechanism is related to a specific software system, not all software systems provide secondary development interfaces, and related limitations are larger. 2) In addition, the existing method is complex in implementation process, and particularly plug-in development based on a web browser is more difficult. 3) The HOOK technology has poor stability, more compatibility problems with system software and lower safety performance. 4) The current watermark embedding algorithm carries out watermark embedding by adopting a sequential redundancy mode aiming at each page of document data, which requires that the watermark information content has local integrity. Therefore, the watermark algorithm has poor performance in resisting malicious attacks such as clipping, rubbing, fouling, tearing and the like. In view of the above problems, the present invention provides a general text watermark solution.
Disclosure of Invention
The invention provides a method and a device for embedding and extracting a universal text watermark generated based on a dynamic word stock, which are used for solving the problems of poor watermark loading universality, poor system stability, complex implementation process, low watermark algorithm robustness and the like in the prior art on the premise of not changing any use habit of a user.
The idea of the invention is that first, a certain number of characters in a selected word stock are uniformly grouped according to a specific strategy, and all characters in each group represent the same watermark information bit string; performing deformation design on all characters in each group according to a specific rule, respectively obtaining a plurality of watermark character contour curve data corresponding to each character, and generating a watermark character data temporary file; generating watermark coding data of the user terminal according to a specific rule so as to identify the identity authentication information of the user terminal; according to the watermark coding data, dynamically generating a watermark font file through a watermark character data temporary file, wherein the watermark font file has the same attribute with a same-name font file installed in a system; loading the watermark font file in real time and replacing the same-name font file installed in the system; running a text file in an electronic format, and embedding watermark information in document content data of file printout and screen display in real time; and acquiring document picture data hidden with watermark information, extracting the watermark information, and tracing a document leakage source. Thus, a universal text watermark embedding and extracting method and device are obtained.
The invention discloses a universal text watermark method, which comprises a text watermark embedding and extracting method, wherein the text watermark embedding comprises the following steps:
step one, grouping a certain number of characters in a selected word stock according to a specific strategy;
step two, performing deformation design on all characters in each group according to a specific rule, and generating a watermark character data temporary file;
generating watermark encoding data of the user terminal to identify the identity authentication information of the user terminal;
step four, dynamically generating and loading a watermark font file in real time according to the watermark encoding data and combining the watermark character data temporary file and the grouped characters;
and step five, operating the text file in the electronic format, and embedding watermark information in the document content data printed out of the file and displayed on the screen in real time by using the watermark font file.
Further, the method also comprises a text watermark extraction step, namely a step six: and acquiring document picture data hidden with watermark information, extracting the watermark information, and tracing a document leakage source.
Preferably, the method for grouping a certain number of characters in the selected word stock comprises:
firstly, according to word frequency statistical resultSequencing common characters from high to low, and acquiring the first N characters to form a watermark character set
Figure BDA0003492152240000031
Next, the N characters are initially divided into M groups, which are denoted as { Ω 11 ,…,Ω M In which M is<N, and is made arbitrary
Figure BDA0003492152240000032
0<i,j≤M,i≠j,
Figure BDA0003492152240000033
The specific grouping process is as follows:
step1. First M characters are selected according to the character frequency sequence
Figure BDA0003492152240000034
Are sequentially divided into { omega 11 ,…,Ω M In each group, adding one character in each group;
step2. Select M characters again
Figure BDA0003492152240000035
In reverse order [ omega ] MM-1 ,…,Ω 1 Adding a character in each group in turn;
and Step3, repeatedly executing the Step1 and the Step2 until the grouping of the N characters is completed.
Then, randomly selecting a certain number of text training corpora, and fixing the number of characters of each text training corpus at t;
finally, all packets { Ω ] are counted 11 ,…,Ω M Probability of characters in the text training corpus t is obtained, and according to probability distribution, grouping results are optimized to obtain final grouping { omega' 1 ,Ω′ 2 ,…,Ω′ M }。
Preferably, the method of optimizing the grouping result according to the probability distribution to obtain the final grouping includes:
step1, calculate per packet Ω i The probability of the characters in the text corpus that is more than or equal to 1 and less than or equal to M appears in t text corpora, and all the probabilities are sequenced from large to small;
step2, taking out the character with the minimum word frequency from the group with the highest probability and moving the character to the group with the lowest probability, taking out the character with the minimum word frequency from the group with the next highest probability and moving the character to the group with the next lowest probability, and repeating the process in sequence until all the movement is finished;
step3, repeating the steps 1 and 2 again until the probability variance of all the packets reaches the minimum, thereby obtaining the final packet { omega' 1 ,Ω′ 2 ,…,Ω′ M }。
Preferably, the performing of the deformation design on all the characters in each group means that the vector contour curve data of the characters is adjusted to obtain d different deformations, wherein d is greater than or equal to 2, and the different contour curve deformations represent different watermark information bit strings;
preferably, the watermark information bit string, all characters in each packet represent the same watermark information bit string.
Preferably, the generating of the temporary file of watermark character data refers to storing the designed and generated character deformation contour curve data in the temporary file together with character attribute description information, where the character attribute description information includes a width of a font frame, a height of the font frame, and an offset of each font in the font frame, where the offset may vary with different font structures.
Preferably, the generating of the user terminal watermark coding data includes the identity authentication information and the time information of the user terminal, and the specific generating method includes a manual designation and an automatic allocation mode.
Preferably, the automatic allocation method includes:
step1, creating a user terminal identity identification information recording table in a system background, wherein the information content comprises user ID, a user login account, a machine MAC address and machine IP address information, the user ID is automatically allocated and added by the background system, and the rest information is automatically submitted for a client monitoring program.
And step2, running a client monitoring program, automatically acquiring identity identification information of the user terminal, uploading the identity identification information to a system background, directly returning the user ID information when the uploaded identification information exists in a record table of a database of the system background, or adding a new record in the database of the system background, adding 1 to the user ID, and returning the user ID to the client.
And step3, after receiving the user ID information returned by the system background, the client monitoring process reads the system operation time in real time, and performs error correction coding processing on the user ID information and the time information to obtain the final user terminal watermark coding data.
Preferably, the error correction coding processing refers to that, for the watermark information bit string with the length Len, len is a multiple of 8, the watermark information error correction coding processing is performed in a parity check manner to obtain complete watermark coding data, and the specific process is as follows: arranging watermark information bit strings into a matrix of Len/8 rows and 8 columns; the parity check code for each column of information is calculated to form line Len/8+1, and the total length of the valid information code and check code is Len +8.
Preferably, the dynamically generating and loading the watermark font file in real time means opening a temporary watermark character data file, dynamically reading font outline curve data from the temporary watermark character data file according to a grouping strategy and user terminal watermark encoding data, and updating corresponding font structure data in a standard font installed in the system.
Preferably, the dynamically generating and loading the watermark word stock file in real time means that the loading of the watermark word stock is respectively completed according to the difference between Windows and Linux operating systems:
1) Windows environment.
Firstly, calling a system API function RemoveFontResource (PCTSTR lpFileName) to remove a standard font library installed by a system from a font table of the system; sending a WM _ FONTCHANG message to all top windows in the system to inform the change; and then, calling an AddFontResource (LPCTS lpszFilename) function to add the dynamically generated water-based font file resource to the system font table for normal use by the Windows application program.
2) A Linux environment.
The standard fonts installed by the system are uninstalled by deleting the font files from the corresponding folders. And when the global font is deleted, sending an fc-cache-fv command to update the font cache of the system. And copying the dynamically generated water lettering library file into a corresponding file directory, and sending an fc-cache-fv command to enable the system to know the change. When fc-cache is completed, all users on the system can access the newly added global fonts.
Preferably, the extracting of the watermark information mainly includes three steps:
1) According to the distribution of character groups in the document content, respectively extracting watermark information bit strings represented by all characters in each group;
2) Checking and correcting the preliminarily extracted watermark information bit string by using a watermark information bit string parity check rule;
3) Finally, splicing the watermark information bit strings extracted from all the groups to obtain the final complete watermark information bit string.
Based on the same inventive concept, the invention also provides a universal text watermarking device, which comprises:
a character grouping module: the system is responsible for grouping a certain number of characters in the selected word stock according to a specific strategy;
a font design module: the system is responsible for performing deformation design on all characters in each group processed by the character grouping module according to a specific rule and generating a watermark character data temporary file;
watermark coding generation module: the system is responsible for generating watermark coding data of the user terminal to identify the identity authentication information of the user terminal;
the word stock generation and loading module: according to the watermark encoding data generated by the watermark encoding generation module, combining the temporary watermark character data file generated by the character design module and the grouping character obtained by the character grouping module, and taking charge of dynamically generating and loading the watermark font file in real time;
a watermark embedding module: the system is responsible for operating a text file in an electronic format, and watermark information is embedded into a watermark word stock file generated and loaded by a word stock generating and loading module in document content data of file printout and screen display in real time;
the watermark extraction module: and the system is responsible for acquiring the document picture data which is obtained after the processing of the watermark embedding module and is hidden with the watermark information, extracting the watermark information and further tracing the document leakage source.
The invention has the following beneficial effects:
according to the invention, according to the unique watermark coding data information of the user terminal, the personalized watermark word stock is dynamically generated at the client, and the unique ID identification information is embedded in each watermark word stock. When the user identity information of different clients is different, the watermark information embedded in the watermark font library is also different. After the watermark font library is loaded in real time, all application software systems calling the local font library to carry out printing output and screen display embed watermark information in the file content in real time. Therefore, the text watermarking method has the advantages of strong universality, good compatibility, high stability and simple watermark information embedding process. In addition, due to the adoption of a packet unordered embedding strategy, the text watermarking method has higher robustness for resisting malicious attacks such as cutting, kneading, fouling, tearing and the like.
Drawings
Fig. 1 is a schematic flow chart of an implementation of a general text watermark embedding and extracting method described in the embodiment;
FIG. 2 is a schematic diagram of an arcuate grouping method;
FIG. 3 is a diagram of a temporary file structure of watermark character data;
fig. 4 is a schematic diagram of error correction encoding of watermark information in the method according to the embodiment;
FIG. 5 is a diagram illustrating a process of dynamically generating a watermark font library;
fig. 6 is a schematic structural diagram of a device for embedding and extracting a general text watermark in an embodiment.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, the present invention shall be described in further detail with reference to the following detailed description and accompanying drawings.
Fig. 1 is a schematic flow chart of an implementation of a general text watermark embedding and extracting method in the embodiment.
S101, grouping a certain number of characters in the selected word stock according to a specific strategy.
In the invention, in order to realize universal watermark information embedding, a unique personalized watermark word stock is dynamically generated at each client, namely, the watermark word stocks generated by different clients are different (which is different from any existing technical scheme, namely, the watermark word stocks generated and installed by each client in the existing technical scheme are the same), and corresponding user terminal identity authentication information is embedded in each watermark word stock. After the watermark word stock is generated, the watermark word stock is dynamically loaded in real time and replaces a standard word stock which is pre-installed in an operating system, and when all application software systems calling watermark fonts of a client side perform screen display and printout operations of an electronic document, watermark information is automatically embedded into the document content in real time. Therefore, the watermarking method has strong universality, simple implementation process and good compatibility with a system and other application software. But in order to ensure sufficient information capacity and watermark extraction efficiency, we represent a specific watermark bit string by a group of characters, i.e. all characters in the group represent the same watermark information bit string. When at least one character of each packet is present in the text content, the corresponding bit string of watermark information can be correctly extracted. Therefore, a certain number of characters in the selected word stock are reasonably grouped according to a specific strategy, so that the probability that the characters in each group appear in the text content is relatively high.
The specific grouping method comprises the following steps:
firstly, sequencing common characters from high to low according to word frequency statistical results, and acquiring top N characters to form a watermark character set
Figure BDA0003492152240000061
In the present embodiment, N =2000.
Next, the N characters are initially divided into M groups { Ω 11 ,…,Ω M In which M is<N, and is made arbitrary
Figure BDA0003492152240000062
Figure BDA0003492152240000063
0<i,j≤M,i≠j,
Figure BDA0003492152240000064
In the present embodiment, M =28.
The specific grouping process is as follows:
step1. First M characters are selected according to the character frequency sequence
Figure BDA0003492152240000065
Are sequentially divided into { omega 11 ,…,Ω M In each group, a character is added in each group.
Step2. Select M characters again
Figure BDA0003492152240000066
In reverse order { Ω MM-1 ,…,Ω 1 One character is added to each packet in turn.
And Step3, repeatedly executing Step1 and Step2 until the N characters are grouped.
The M groups [ omega ] are preliminarily obtained by the arch grouping method shown in FIG. 2 11 ,…,Ω M For example, the character set in the first group is:
Figure BDA0003492152240000071
then, a certain number of text corpora are randomly selected, and the number of characters of each text corpora is fixed to t, where t =200 in this embodiment. In order to verify the probability value of each group of characters appearing in common text documents, a large number of samples need to be collected for training test. Therefore, nearly 50 million articles are downloaded by means of internet crawlers, wherein the articles cover the fields of politics, military affairs, news, sports, culture, history, finance and the like. And (3) storing each collected article as a text training corpus of 200 characters after content filtering and clipping operations.
Finally, all packets { Ω ] are counted 11 ,…,Ω M Probability of characters in the text training corpus t is obtained, and according to probability distribution, grouping results are optimized to obtain final grouping { omega' 1 ,Ω′ 2 ,…,Ω′ M }。
Because the grouping operation is performed only based on the word frequency sorting result in the method, the situation of uneven probability distribution may occur in the actual text corpus training process, and therefore specific optimization operation is required to obtain more balanced grouping. The specific grouping optimization method comprises the following steps:
step1, calculate per packet Ω i The probability of the characters in the training corpus, i is more than or equal to 1 and less than or equal to M, is obtained, and all the probabilities are sequenced according to the sequence from large to small;
step2, taking out the character with the minimum word frequency from the group with the highest probability and moving the character to the group with the lowest probability, taking out the character with the minimum word frequency from the group with the next highest probability and moving the character to the group with the next lowest probability, and repeating the process in sequence until all the movement is finished;
step3, repeating the Step1 and the Step2 again until the probability variance of all the packets reaches the minimum, thereby obtaining the optimal packet [ omega' 1 ,Ω′ 2 ,…,Ω′ M }。
S102, performing deformation design on all characters in each group according to a specific rule, and generating a watermark character data temporary file.
The character deformation design characteristic refers to that vector outline curve data of a character is adjusted to obtain d different deformations, wherein d is larger than or equal to 2, and the different outline curve deformations represent different watermark information bit strings. In order to effectively increase the watermark information capacity, the number of character variations usually exceeds 2, and in the present embodiment, d =4. I.e. 4 different variants per character, represents a 2-bit watermark information bit string. In addition, it should be noted that all characters in each packet represent the same watermark information bit string.
And storing the character deformation contour curve data generated by the design in a temporary file together with character attribute description information, wherein the character attribute description information comprises the width of a font frame, the height of the font frame and the offset of each font in the font frame, and the offset can be changed along with the difference of font structures. In the temporary file, a storage structure of each font data is as shown in fig. 3. The specific information is described as follows:
UNICODE encoding of characters: assigning a unique UNICODE code to the character represented by the glyph structure in the temporary file;
horizontal layout: the width of the font outer frame and the distance from the leftmost point of the font contour line to the font left frame are included;
vertical layout: the height of the font outer frame and the distance from the topmost point of the font contour line to the font upper frame are included;
size of primitive data: the capacity of the vector outline curve data of the character pattern structure is represented, and the unit is byte;
primitive data: an array of BYTE types stores vector outline curve data of a specific glyph structure, and also includes the definition of a grid and associated instruction data.
S103, generating watermark coding data of the user terminal to identify the identity authentication information of the user terminal.
In order to effectively track and trace a client-side divulgence event, unique watermark coding data information needs to be generated for a user terminal, the content includes identity authentication information and time information of the user terminal, and a specific generation method includes a manual designation mode and an automatic distribution mode. The manual designation mode refers to that when the client system is installed and deployed, unique encoding information is manually designated for the client. The method for automatically distributing the watermark coding data characteristics comprises the following steps:
step1, creating a user terminal identity identification information recording table in a system background, wherein the information content comprises user ID, a user login account, a machine MAC address and machine IP address information, the user ID is automatically distributed and added by the background system, and the rest information is automatically submitted for a client monitoring program;
and step2, running a client monitoring program, automatically acquiring identity identification information of the user terminal, uploading the identity identification information to a system background, directly returning the user ID information when the uploaded identification information exists in a record table of a database of the system background, or adding a new record in the database of the system background, adding 1 to the user ID, and returning the user ID to the client.
And step3, after receiving the user ID information returned by the system background, the client monitoring process reads the system operation time in real time, and performs error correction coding processing on the user ID information and the time information to obtain the final user terminal watermark coding data.
Aiming at a watermark information bit string with the length Len, len is a multiple of 8, and a parity check mode is adopted to carry out error correction coding processing on the watermark information to obtain complete watermark coding data, and the specific process is as follows: arranging watermark information bit strings into a matrix of Len/8 rows and 8 columns; the parity check code for each column of information is calculated to form line Len/8+1, and the total length of the valid information code and check code is Len +8. In this embodiment, let =48, the watermark information bit string is arranged as shown in fig. 4, the first 6 rows are valid watermark information bit strings, and the information in each column of the last row (the gray part in the figure) is the parity of the information bit string in the first 6 rows of the column where the information is located, so that the final watermark encoding data amount is 56 bits. Whereas in the present embodiment 28 packets are selected, each representing 2 bits of watermark information, exactly 56 bits of watermark encoding data can be embedded.
And S104, dynamically generating and loading the watermark font file in real time according to the watermark coding data.
And the process of dynamically generating the watermark font library refers to opening a temporary file of the watermark character data, dynamically reading font outline curve data from the temporary file according to a grouping strategy and watermark coding information, and updating corresponding font structure data in a standard font library installed in the system. The specific process is shown in fig. 5.
Step1, firstly, analyzing key data of a system installation standard word stock file, wherein the key data comprises a font information header, a maximum requirement table, a name table, a font coding mapping table, a primitive position index table, a font horizontal layout degree table, a font vertical layout degree table, primitive data and the like;
step2, reading a temporary file containing watermark character contour curve data;
step3, initializing an empty font coding mapping table M, a primitive position index table S, a horizontal layout degree table H and a vertical layout degree table V, and generating an empty primitive data file tmp for storing the primitive data information of the vector;
step4, writing each character data in sequence, and setting the data quantity of the primitive data corresponding to the ith character currently processed and all the i-1 characters processed previously as dwS i-1 Then the value S of the i-1 th item in the primitive position index table i-1 =dwS i-1 At this time, the ith character processing procedure is as follows:
1) Updating a font code mapping table M according to the UNICODE codes represented by the characters in the font library;
2) Reading the horizontal and vertical layout information of the character and the data volume dw of the new primitive from the temporary file N And vector outline data of the new primitive, and update to the corresponding table of the target watermark word stock;
3) Updating a primitive position index table of the target watermark font file as follows: s i =S i-1 +dw N
4) And saving the primitive data read from the temporary file into a primitive data file tmp.
And step5, writing the font file header information and the related attribute value information, the font coding mapping table, the primitive position index table and the horizontal and vertical layout degree table in sequence according to the structure of the font file, and finally writing all the primitive data stored in the primitive data file tmp into a primitive data area of a newly generated font library to generate a new watermark font library file.
As described above, after the watermark font library is generated according to the watermark encoding information, the font resource table of the operating system needs to be updated to complete the correct call of the application software system to the watermark font library. According to different operating systems, the loading process of the watermark word stock is divided into the following two cases:
1) A Windows environment.
Firstly, calling a system API function RemoveFontResource (PCTSTR lpFileName) to remove a standard word library installed by the system from a system word table; sending a WM _ FONTCHANGG message to all top windows in the system to inform the change; and then, calling an AddFontResource (LPCTSTRTlpszFilename) function to add the dynamically generated water-printed library file resource to the system font table for normal use by the Windows application program.
2) A Linux environment.
The standard fonts installed by the system are uninstalled by deleting the font files from the corresponding folders. And when the global font is deleted, sending an fc-cache-fv command to update the font cache of the system. And copying the dynamically generated water lettering library file into a corresponding file directory, and sending an fc-cache-fv command to enable the system to know the change. When the fc-cache is completed, all users on the system can access the newly added global fonts.
And S105, running the text file in the electronic format, and embedding watermark information in the document content data of the file printout and the screen display in real time.
As described above, in the present invention, the watermark font file is dynamically generated and updated to the font table of the operating system to replace the standard font file installed by the operating system, so that the application program of the system automatically calls the newly loaded watermark font file, thereby completing the real-time embedding of the watermark information.
S106, obtaining the document picture data with the hidden watermark information, extracting the watermark information, and tracing the document divulgence source.
The watermark information extraction process mainly comprises three steps:
1) According to the distribution of character groups in the document content, respectively extracting watermark information bit strings represented by all characters in each group;
2) Checking and correcting the preliminarily extracted watermark information bit string by using a watermark information bit string parity check rule;
3) And finally, splicing all the information strings extracted by the groups to obtain a final complete watermark information bit string.
It should be noted that when the same character appears at different positions in the document content, the represented watermark information bit string is the same, and the watermark information represented by all characters that appear in the document content and belong to the same group is also the same. In the extraction of watermark information, in order to consider the efficiency of processing, a "multiple" threshold p for watermark character extraction in each packet is defined, that is, watermark information is extracted at most p times in each packet. When the number of occurrences of the same character in the packet exceeds p, or the number of characters contained exceeds p, the watermark extraction process is only run p times. Otherwise, the watermark extraction operation will be performed for all the appearing characters.
As shown in fig. 6, based on the same inventive concept, the present invention further provides a general text watermark embedding and extracting apparatus, including:
character grouping module 1: the system is responsible for grouping a certain number of characters in the selected word stock according to a specific strategy;
font design module 2: the system is responsible for performing deformation design on all characters in each group processed by the character grouping module according to a specific rule and generating a watermark character data temporary file;
watermark coding generation module 3: the system is responsible for generating watermark coding data of the user terminal to identify the identity authentication information of the user terminal;
the word stock generation and loading module 4: according to the watermark encoding data generated by the watermark encoding generation module, combining the watermark character data temporary file generated by the font design module and the grouping character obtained by the character grouping module processing, the watermark character library file is responsible for dynamically generating and loading the watermark character library file in real time;
the watermark embedding module 5: the system is in charge of operating a text file in an electronic format, and embedding watermark information into a watermark font file generated and loaded by a font generation and loading module in real time when the file is printed out or the content of the file is displayed on a screen;
the watermark extraction module 6: and the system is responsible for acquiring the document picture data which is obtained after the processing of the watermark embedding module and is hidden with the watermark information, extracting the watermark information and further tracing the document leakage source.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.
For example, in order to prevent the dynamically generated watermark font library from being maliciously tampered, or the watermark font library from being damaged by reinstallation of a related font by a user, a watermark font library detection service program is deployed and installed on the client. And the detection service program calculates the SHA1 value of the newly generated water printing library file and records the value in a system ledger. The detection service program scans the loaded watermark font library file at regular time, calculates the SHA1 value of the current watermark font library file, and compares the value with the value recorded in the system ledger. If the two types of the files are different, the watermark word stock file is destroyed, and the watermark word stock generating and loading processes are executed again.
For example, the generated watermark encoding data of the user terminal includes the identity authentication information and the time information of the user terminal. In order to accurately record different time information every day, a watermark font library generation time information monitoring program can be deployed and installed on the client side. When the operating system is restarted, the monitoring program executes the dynamic generation and real-time loading work of the watermark font library and records the effective time of the current watermark font library. And during the normal operation of the operating system, the monitoring program regularly detects the effective time of the watermark font library. And if the current time is not in the same day as the effective time, the monitoring program re-executes the dynamic generation and real-time loading work, and updates the effective time of the watermark font library again.
For example, in order to improve the dynamic generation efficiency of the watermark word stock of the user terminal, a watermark word stock generation time information monitoring program is deployed and installed at the client, and the corresponding watermark word stock is generated in advance according to different time periods. When the operating system is restarted, the monitoring program selects the corresponding watermark font library file according to the current time period of the system, executes the copy and real-time loading work of the watermark font library file, and then deletes the used overdue watermark font library file.
For example, in the character grouping optimization process, the grouping strategy can be further optimized in a word splitting mode. For example, the characters "aim" and "are present simultaneously in the first packet, while the word" aim "is a high frequency word, i.e. in normal text content the probability of the simultaneous presence of" aim "and" is relatively high. Therefore, the 'target' words with relatively low word frequency can be moved to the packets with lower probability in the current optimization stage, so that the more probability-equalized packet results can be obtained.
For example, in the method for generating a watermark font library according to embodiment 1, only one of the contour curve deformation data of each character is updated to the standard character encoding area, so that the system application software can be normally used without any problems such as messy code display. When the contour curve data of all character deformations of each character is copied to the extended coding area of the word stock, and a unique UNICODE code is respectively allocated to each character deformation, more watermark information can be embedded in the text content with fewer characters by dynamically replacing the character codes in the process of document printout or outgoing management, thereby improving the watermark information capacity.

Claims (9)

1. A general text watermarking method, comprising the steps of:
grouping a certain number of characters in the selected word stock according to a specific strategy;
performing deformation design on all characters in each group according to a specific rule, and generating a watermark character data temporary file;
generating watermark coding data of the user terminal to identify the identity authentication information of the user terminal;
dynamically generating and loading a watermark font file in real time according to the watermark encoding data and by combining the watermark character data temporary file and the grouped characters;
running a text file in an electronic format, and embedding watermark information in document content data printed out of the file and displayed on a screen in real time by using a watermark font file;
the grouping of a certain number of characters in the selected word stock comprises:
firstly, sequencing common characters from high to low according to word frequency statistical results, and acquiring first N characters to form a watermark character set
Figure FDA0003849491710000011
Next, the N characters are preliminarily divided into M groups, which are marked as { omega 11 ,…,Ω M In which M is<N, and make arbitrary
Figure FDA0003849491710000012
0<i,j≤M,i≠j,
Figure FDA0003849491710000013
The specific grouping process comprises the following steps:
step1. First M characters are selected according to the character frequency sequence
Figure FDA0003849491710000014
Are sequentially divided into { omega 11 ,…,Ω M In each group, adding a character in each group;
step2. Select M characters again
Figure FDA0003849491710000015
In reverse order [ omega ] MM-1 ,…,Ω 1 Adding a character in each group in turn;
step3, repeatedly executing Step1 and Step2 until the grouping of the N characters is finished;
then, randomly selecting a certain number of text training corpora, and fixing the number of characters of each text training corpus at t;
finally, all packets { Ω ] are counted 11 ,…,Ω M Probability of characters in the text training corpus t is obtained, and according to probability distribution, grouping results are optimized to obtain final grouping { omega' 1 ,Ω′ 2 ,…,Ω′ M }。
2. The method of claim 1, further comprising the steps of:
and acquiring document picture data hidden with watermark information, extracting the watermark information, and tracing a document leakage source.
3. The method according to claim 1 or 2, wherein the optimizing the grouping result according to the probability distribution to obtain the final grouping comprises:
step1, calculate per packet Ω i The probability of the characters in the M is more than or equal to 1 and less than or equal to M appearing in t text training corpora, and all the probabilities are sequenced from large to small;
step2, extracting the character with the minimum word frequency from the group with the highest probability, moving the character with the minimum word frequency to the group with the lowest probability, and repeating the process in sequence until the movement is completely finished;
step3, repeating the steps 1 and 2 again until the probability variance of all the packets reaches the minimum, thereby obtaining the final packet { omega' 1 ,Ω′ 2 ,…,Ω′ M }。
4. The method according to claim 1 or 2, wherein the performing deformation design on all the characters in each group means that vector contour curve data of the characters are adjusted to obtain d different deformations, wherein d is larger than or equal to 2, and the different contour curve deformations represent different watermark information bit strings; all characters in each packet represent the same watermark information bit string; the generation of the temporary file of the watermark character data refers to the fact that character deformation contour curve data generated by design and character attribute description information are stored in the temporary file together, wherein the character attribute description information comprises the width of a font frame, the height of the font frame and the offset of each font in the font frame, and the offset can be changed along with the difference of font structures.
5. The method according to claim 1 or 2, characterized in that the user terminal watermark encoding data comprises the identity authentication information and the time information of the user terminal, and the user terminal watermark encoding data is generated by adopting a manual designation or automatic distribution mode; the automatic allocation mode comprises the following steps:
step1, creating a user terminal identity identification information recording table in a system background, wherein the information content comprises user ID, a user login account, a machine MAC address and machine IP address information, the user ID is automatically allocated and added by the background system, and the rest information is automatically submitted for a client monitoring program;
step2, operating a client monitoring program, automatically acquiring identity identification information of the user terminal and uploading the identity identification information to a system background, and directly returning user ID information when the uploaded identification information exists in a record table of a system background database, or else, adding a new record in the system background database, adding 1 to the user ID and returning the user ID to the client;
and step3, after receiving the user ID information returned by the system background, the client monitoring process reads the system operation time in real time, and performs error correction coding processing on the user ID information and the time information to obtain the final user terminal watermark coding data.
6. The method according to claim 5, wherein the error correction coding processing refers to performing error correction coding processing on the watermark information by using a parity check method to obtain complete watermark coding data for a watermark information bit string with a length Len, len being a multiple of 8, and the specific process is as follows: arranging watermark information bit strings into a matrix of Len/8 rows and 8 columns; the parity check code for each column of information is calculated to form line Len/8+1, and the total length of the valid information code and check code is Len +8.
7. The method according to claim 1 or 2, wherein the dynamically generating and loading the watermark font file in real time means opening a temporary watermark character data file, dynamically reading font outline curve data from the temporary watermark character data file according to a grouping strategy and user terminal watermark encoding data, and updating corresponding font structure data in a standard font installed in the system; the real-time loading of the watermark word stock file refers to the fact that the watermark word stock is loaded respectively according to different Windows and Linux operating systems.
8. The method of claim 2, wherein the extracting watermark information comprises:
according to the distribution of character groups in the document content, respectively extracting watermark information bit strings represented by all characters in each group;
checking and correcting the preliminarily extracted watermark information bit string by using a watermark information bit string parity check rule;
finally, splicing all watermark information bit strings extracted from the groups to obtain a final complete watermark information bit string.
9. A universal text watermarking apparatus that employs the method of any one of claims 1 to 8, comprising:
the character grouping module is responsible for grouping a certain number of characters in the selected word stock according to a specific strategy;
the character pattern design module is responsible for performing deformation design on all characters in each group processed by the character grouping module according to a specific rule and generating a watermark character data temporary file;
the watermark coding generation module is responsible for generating watermark coding data of the user terminal and is used for identifying the identity authentication information of the user terminal;
the font library generating and loading module is responsible for dynamically generating and loading the watermark font library file in real time according to the watermark encoding data generated by the watermark encoding generating module, and by combining the watermark character data temporary file generated by the font design module and the grouped characters obtained by the character grouping module;
the watermark embedding module is responsible for operating the text file in the electronic format, and embedding watermark information in real time by using the watermark word stock file generated and loaded by the word stock generating and loading module when the file is printed out or the content of the file is displayed on a screen;
and the watermark extraction module is in charge of acquiring the document picture data which is obtained by processing of the watermark embedding module and is hidden with the watermark information, extracting the watermark information and further tracing a document divulgence source.
CN202210100368.XA 2022-01-27 2022-01-27 Universal text watermarking method and device Active CN114708133B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210100368.XA CN114708133B (en) 2022-01-27 2022-01-27 Universal text watermarking method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210100368.XA CN114708133B (en) 2022-01-27 2022-01-27 Universal text watermarking method and device

Publications (2)

Publication Number Publication Date
CN114708133A CN114708133A (en) 2022-07-05
CN114708133B true CN114708133B (en) 2022-11-15

Family

ID=82166113

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210100368.XA Active CN114708133B (en) 2022-01-27 2022-01-27 Universal text watermarking method and device

Country Status (1)

Country Link
CN (1) CN114708133B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115455987B (en) * 2022-11-14 2023-05-05 合肥高维数据技术有限公司 Character grouping method based on word frequency and word frequency, storage medium and electronic equipment
CN115455966B (en) * 2022-11-14 2023-03-10 合肥高维数据技术有限公司 Safe word stock construction method and safe code extraction method thereof
CN115455965B (en) * 2022-11-14 2023-03-10 合肥高维数据技术有限公司 Character grouping method based on word distance word chain, storage medium and electronic equipment
CN115630343B (en) * 2022-12-20 2023-04-07 北京国隐科技有限公司 Electronic document information processing method, device and equipment
CN117891787B (en) * 2024-03-15 2024-05-28 武汉磐电科技股份有限公司 Current transformer quantity value tracing data processing method, system and equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012132953A (en) * 2010-12-18 2012-07-12 Kimito Horie Configuration method and device of hyperbola cryptography using virtual surrogate pair
CN103136718A (en) * 2013-03-13 2013-06-05 上海理工大学 Anti-geometric-transformation practical color image digital watermark technology
CN106570356A (en) * 2016-11-01 2017-04-19 南京理工大学 Unicode coding-based text watermark embedding method and extraction method
CN107423629A (en) * 2017-04-12 2017-12-01 李晓妮 A kind of anti-method and system divulged a secret with tracing of fileinfo output
CN108763879A (en) * 2018-05-30 2018-11-06 北京溯斐科技有限公司 A kind of automatic generation method and device of watermark character library
CN110196968A (en) * 2019-06-06 2019-09-03 北京林业大学 A kind of simplified form of Chinese Character coding mode automatic recognition system and method searched based on specific character string

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180096203A1 (en) * 2004-04-12 2018-04-05 Google Inc. Adding value to a rendered document
CN110674477A (en) * 2019-09-24 2020-01-10 北京溯斐科技有限公司 Document source tracing method and device based on electronic file security identification

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012132953A (en) * 2010-12-18 2012-07-12 Kimito Horie Configuration method and device of hyperbola cryptography using virtual surrogate pair
CN103136718A (en) * 2013-03-13 2013-06-05 上海理工大学 Anti-geometric-transformation practical color image digital watermark technology
CN106570356A (en) * 2016-11-01 2017-04-19 南京理工大学 Unicode coding-based text watermark embedding method and extraction method
CN107423629A (en) * 2017-04-12 2017-12-01 李晓妮 A kind of anti-method and system divulged a secret with tracing of fileinfo output
CN108763879A (en) * 2018-05-30 2018-11-06 北京溯斐科技有限公司 A kind of automatic generation method and device of watermark character library
CN110196968A (en) * 2019-06-06 2019-09-03 北京林业大学 A kind of simplified form of Chinese Character coding mode automatic recognition system and method searched based on specific character string

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
A Robust and ReversibleWatermarking Algorithm for a Relational Database Based on Continuous Columns in Histogram;Yan Li,et al.;《mathematics》;20201208;第1-12页 *
Watermarking Techniques for Relational Databases: Survey, Classification and Comparison;Halder Raju,et al.;《Journal of Universal Computer Science》;20101201;第3164-3190页 *
中文水印字库的自动生成方法;孙杉 等;《中国图象图形学报》;20210211;第262-276页 *
基于自动生成字库的中文鲁棒文档水印方法;孙杉;《中国优秀硕士论文全文数据库 信息科技辑》;20210915;I141-44 *

Also Published As

Publication number Publication date
CN114708133A (en) 2022-07-05

Similar Documents

Publication Publication Date Title
CN114708133B (en) Universal text watermarking method and device
US7478325B2 (en) Methods for providing an accurate visual rendition of a text element formatted with an unavailable font
US6560598B2 (en) Internal database validation
DE60029567T2 (en) DIGITAL DATA MANAGEMENT AND IMAGE MANUFACTURING SYSTEM AND METHOD WITH SECURED DATA MARKING
CN110532811B (en) PDF (Portable document Format) signature method and PDF signature system
US7523318B2 (en) Method and system for automated password generation
US20050171965A1 (en) Contents reuse management apparatus and contents reuse support apparatus
CN108805787B (en) Paper document tampering and authenticating method and device
WO2004040464B1 (en) A method and system for managing confidential information
US7333992B2 (en) System and method for identifying and storing changes made to a table
CN110362542B (en) Nuclear power station file encoding method and device, computer equipment and storage medium
CN112508145A (en) Electronic seal generation and verification method and device, electronic equipment and storage medium
CN114356919A (en) Watermark embedding method, tracing method and device for structured database
CN114386853A (en) Data auditing processing method, device and equipment based on universal auditing model
CN109886076B (en) Invoice storage method
CN115618809A (en) Character grouping method based on binary character frequency and safe word stock construction method
CN101206708A (en) Image processing apparatus and image processing method
CN114387429A (en) Vehicle property right processing method, device, equipment and medium based on RPA and AI
US8576049B2 (en) Document authentication and identification
CN115455966B (en) Safe word stock construction method and safe code extraction method thereof
Al-Hashim et al. Benchmark database and GUI environment for printed Arabic text recognition research
CN115455987B (en) Character grouping method based on word frequency and word frequency, storage medium and electronic equipment
Abdullah et al. X_myKarve: Non-contiguous JPEG file carver
CN112148710B (en) Micro-service library separation method, system and medium
CN113779939B (en) Document hot patch generation method, document hot patch application method and document hot patch Ding Zhuangzhi

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant