CN114708133A - Universal text watermarking method and device - Google Patents

Universal text watermarking method and device Download PDF

Info

Publication number
CN114708133A
CN114708133A CN202210100368.XA CN202210100368A CN114708133A CN 114708133 A CN114708133 A CN 114708133A CN 202210100368 A CN202210100368 A CN 202210100368A CN 114708133 A CN114708133 A CN 114708133A
Authority
CN
China
Prior art keywords
watermark
information
character
file
characters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210100368.XA
Other languages
Chinese (zh)
Other versions
CN114708133B (en
Inventor
李公宝
丛升日
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Guoyin Technology Co ltd
Original Assignee
Beijing Guoyin Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Guoyin Technology Co ltd filed Critical Beijing Guoyin Technology Co ltd
Priority to CN202210100368.XA priority Critical patent/CN114708133B/en
Publication of CN114708133A publication Critical patent/CN114708133A/en
Application granted granted Critical
Publication of CN114708133B publication Critical patent/CN114708133B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/0021Image watermarking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/109Font handling; Temporal or kinetic typography

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Editing Of Facsimile Originals (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention relates to a universal text watermarking method and a universal text watermarking device. The method comprises the following steps: grouping a certain number of characters in the selected word stock according to a specific strategy; performing deformation design on all characters in each group according to a specific rule, and generating a watermark character data temporary file; generating watermark coding data of the user terminal to identify the identity authentication information of the user terminal; dynamically generating and loading a watermark font file in real time according to the watermark encoding data and by combining the watermark character data temporary file and the grouped characters; running a text file in an electronic format, and embedding watermark information in document content data printed out of the file and displayed on a screen in real time by using a watermark font file; and acquiring document picture data hidden with watermark information, extracting the watermark information, and tracing a document leakage source. The text watermarking method has the advantages of strong universality, good compatibility, high stability and simple watermark information embedding process.

Description

Universal text watermarking method and device
Technical Field
The invention belongs to the technical field of document protection and image processing, relates to a method and a device for embedding and extracting a digital watermark, and particularly relates to a universal text watermarking method and a universal text watermarking device.
Background
In the existing text watermarking technology, in order to improve the robustness of a watermarking algorithm against malicious attacks such as printing and scanning, screen capture, screen shooting and the like, a text digital watermarking technology based on character topological structure modification becomes a mainstream. The character deformation data is stored in a specific watermark font library by corresponding to different watermark information bit strings after the specific characters are deformed in different forms, and the watermark information is embedded by font replacement in the process of printing and outputting electronic text documents and displaying screens. Therefore, real-time loading of watermark information is a key step of text watermarking technology based on font replacement. Currently, the loading of the word stock and the embedding of the watermark information are mainly realized through the following modes: 1) via HOOK technology. In order to be able to replace font information in real time, it is necessary to acquire content data of a text file in advance. The specific operation of the file is generally intercepted by a system hook, then the intermediate format file data is obtained, the watermark information is embedded after the font replacement operation is carried out, and finally the normal file operation is executed. For example, watermark information can be embedded in real time in the process of normally printing and outputting a document through the printing operation of a HOOK electronic text file; or completing the analysis of the content data in the specific file format after intercepting the opening operation of the file, executing font type replacement and embedding watermark information, and finally opening the electronic file embedded with the watermark information. 2) By means of a software plug-in mechanism. In order to replace the word stock in the specific file format content and embed the watermark information, a corresponding software plug-in module can be developed to complete the file content analysis. For example, based on vba (visual Basic for applications) macro or vsto (visual Studio Tools for Office) technology of microsoft Office software, watermark information embedding in common Office format files such as Doc, Docx, Excel or PPT is realized. When the electronic file with the format is opened or printed out, the office plug-in calls an interface provided by a software system to finish analyzing and modifying the content of the electronic file, so that watermark information embedding is finished after word stock type replacement.
However, the above methods have many problems and disadvantages: 1) the general use is poor. For example, watermark information embedding in the process of file printing operation can be well completed based on the HOOK technology, but the screen display operation of the electronic file is difficult to intercept, and watermark information embedding cannot be carried out in real time in the screen display content; the embedding method based on the software plug-in mechanism is related to a specific software system, not all software systems provide secondary development interfaces, and related limitations are larger. 2) In addition, the existing method is complex in implementation process, and particularly plug-in development based on a web browser is more difficult. 3) The HOOK technology has poor stability, more compatibility problems with system software and lower safety performance. 4) The current watermark embedding algorithm carries out watermark embedding by adopting a sequential redundancy mode aiming at each page of document data, which requires that the watermark information content has local integrity. Therefore, the watermark algorithm has poor performance in resisting malicious attacks such as clipping, rubbing, fouling, tearing and the like. In view of the above problems, the present invention provides a general text watermarking solution.
Disclosure of Invention
The invention provides a method and a device for embedding and extracting a universal text watermark generated based on a dynamic word stock, which are used for solving the problems of poor universality of watermark loading, poor system stability, complex implementation process, low robustness performance of a watermark algorithm and the like in the prior art on the premise of not changing any use habit of a user.
The invention is characterized in that firstly, a certain number of characters in a selected character library are uniformly grouped according to a specific strategy, and all the characters in each group represent the same watermark information bit string; performing deformation design on all characters in each group according to a specific rule, respectively obtaining a plurality of watermark character contour curve data corresponding to each character, and generating a watermark character data temporary file; generating watermark coding data of the user terminal according to a specific rule so as to identify the identity authentication information of the user terminal; according to the watermark coding data, dynamically generating a watermark font file through a watermark character data temporary file, wherein the watermark font file has the same attribute with a same-name font file installed in a system; loading the watermark font file in real time and replacing the same-name font file installed in the system; running a text file in an electronic format, and embedding watermark information in document content data of file printout and screen display in real time; and acquiring document picture data with hidden watermark information, extracting the watermark information, and tracing a document divulgence source. Therefore, a universal text watermark embedding and extracting method and device are obtained.
The invention discloses a universal text watermark method, which comprises a text watermark embedding and extracting method, wherein the text watermark embedding comprises the following steps:
step one, grouping a certain number of characters in a selected word stock according to a specific strategy;
step two, performing deformation design on all characters in each group according to a specific rule, and generating a watermark character data temporary file;
generating watermark encoding data of the user terminal to identify the identity authentication information of the user terminal;
step four, dynamically generating and loading a watermark font file in real time according to watermark encoding data and combining a watermark character data temporary file and grouped characters;
and step five, operating the text file in the electronic format, and embedding watermark information in the document content data printed out of the file and displayed on the screen in real time by using the watermark font file.
Further, the method also comprises a text watermark extraction step, namely the step six: and acquiring document picture data hidden with watermark information, extracting the watermark information, and tracing a document leakage source.
Preferably, the method for grouping a certain number of characters in the selected word stock comprises:
firstly, sequencing common characters from high to low according to word frequency statistical results, and acquiring top N characters to form a watermark character set
Figure BDA0003492152240000031
Next, the N characters are initially divided into M groups, which are denoted as { Ω11,…,ΩMIn which M is<N, and make arbitrary
Figure BDA0003492152240000032
0<i,j≤M,i≠j,
Figure BDA0003492152240000033
The specific grouping process is as follows:
step1. first M characters are selected according to the character frequency sequence
Figure BDA0003492152240000034
Are sequentially divided into { omega11,…,ΩMIn each group, adding a character in each group;
step2. select M characters again
Figure BDA0003492152240000035
In reverse order { ΩMM-1,…,Ω1Adding a character in each group in turn;
and Step3, repeatedly executing Step1 and Step2 until the N characters are grouped.
Then, randomly selecting a certain number of text training corpora, and fixing the number of characters of each text training corpus at t;
finally, all packets { Ω ] are counted11,…,ΩMProbability of characters in the text training corpus t is obtained, and according to probability distribution, grouping results are optimized to obtain final grouping { omega'1,Ω′2,…,Ω′M}。
Preferably, the method of optimizing the grouping result according to the probability distribution to obtain the final grouping includes:
step1, calculate each packet ΩiThe probability of the characters in the M is more than or equal to 1 and less than or equal to M appearing in t text training corpora, and all the probabilities are sequenced from large to small;
step2, extracting the character with the minimum word frequency from the grouping with the highest probability and moving the character into the grouping with the lowest probability, extracting the character with the minimum word frequency from the grouping with the next highest probability and moving the character into the grouping with the next lower probability, and repeating the process in sequence until the movement is completely finished;
step3, repeating the steps 1 and Step2 again until the probability variance of all the packets reaches the minimum, thereby obtaining the final packet { omega'1,Ω′2,…,Ω′M}。
Preferably, the performing of the deformation design on all the characters in each group means that the vector contour curve data of the characters is adjusted to obtain d different deformations, wherein d is larger than or equal to 2, and the different contour curve deformations represent different watermark information bit strings;
preferably, the watermark information bit string, all characters in each packet represent the same watermark information bit string.
Preferably, the generating of the temporary file of watermark character data refers to storing the designed and generated character deformation contour curve data in the temporary file together with character attribute description information, where the character attribute description information includes a width of a font frame, a height of the font frame, and an offset of each font in the font frame, where the offset may vary with different font structures.
Preferably, the generating of the user terminal watermark encoding data includes identity authentication information and time information of the user terminal, and the specific generating method includes a manual designation and an automatic allocation mode.
Preferably, the automatic allocation method includes:
step1, creating a user terminal identity identification information recording table in a system background, wherein the information content comprises user ID, a user login account, a machine MAC address and machine IP address information, the user ID is automatically allocated and added by the background system, and the rest information is automatically submitted for a client monitoring program.
And step2, running a client monitoring program, automatically acquiring identity identification information of the user terminal, uploading the identity identification information to a system background, directly returning the user ID information when the uploaded identification information exists in a record table of a database of the system background, or adding a new record in the database of the system background, adding 1 to the user ID, and returning the user ID to the client.
And step3, after receiving the user ID information returned by the system background, the client monitoring process reads the system operation time in real time, and performs error correction coding processing on the user ID information and the time information to obtain the final user terminal watermark coding data.
Preferably, the error correction coding processing refers to that, for the watermark information bit string with the length Len, Len is a multiple of 8, the watermark information error correction coding processing is performed in a parity check manner to obtain complete watermark coding data, and the specific process is as follows: arranging watermark information bit strings into a matrix of Len/8 rows and 8 columns; the parity check code for each column of information is calculated to form the Len/8+1 th row, and the total length of the valid information code and the valid parity check code is Len + 8.
Preferably, the dynamically generating and loading the watermark font file in real time means opening a temporary watermark character data file, dynamically reading font outline curve data from the temporary watermark character data file according to a grouping strategy and user terminal watermark encoding data, and updating corresponding font structure data in a standard font installed in the system.
Preferably, the dynamically generating and loading the watermark word stock file in real time means that the loading of the watermark word stock is respectively completed according to the difference between Windows and Linux operating systems:
1) a Windows environment.
Firstly, calling a system API function RemoveFontResource (PCTSTR lpFileName) to remove a standard font library installed by a system from a system font table; sending a WM _ FONTCHANGG message to all top windows in the system to inform the change; then, an AddFontResource (LPCTS lpszFilename) function is called to add the dynamically generated water-based font file resource to the system font table for normal use by the Windows application program.
2) A Linux environment.
The standard fonts installed by the system are uninstalled by deleting the font files from the corresponding folders. When the global font is deleted, an fc-cache-fv command is sent out to update the font cache of the system. And copying the dynamically generated water lettering library file into a corresponding file directory, and sending an fc-cache-fv command to enable the system to know the change. When fc-cache is completed, all users on the system can access the newly added global fonts.
Preferably, the extracting of the watermark information mainly includes three steps:
1) according to the distribution of character groups in the document content, respectively extracting watermark information bit strings represented by all characters in each group;
2) checking and correcting the preliminarily extracted watermark information bit string by using a watermark information bit string parity check rule;
3) finally, splicing all watermark information bit strings extracted from the groups to obtain a final complete watermark information bit string.
Based on the same inventive concept, the invention also provides a universal text watermarking device, which comprises:
a character grouping module: the system is responsible for grouping a certain number of characters in the selected word stock according to a specific strategy;
a font design module: the system is responsible for performing deformation design on all characters in each group processed by the character grouping module according to a specific rule and generating a watermark character data temporary file;
watermark coding generation module: the system is responsible for generating watermark coding data of the user terminal to identify the identity authentication information of the user terminal;
the word stock generation and loading module comprises: according to the watermark encoding data generated by the watermark encoding generation module, combining the temporary watermark character data file generated by the character design module and the grouping character obtained by the character grouping module, and taking charge of dynamically generating and loading the watermark font file in real time;
a watermark embedding module: the system is responsible for operating a text file in an electronic format, and watermark information is embedded into a watermark word stock file generated and loaded by a word stock generating and loading module in document content data of file printout and screen display in real time;
a watermark extraction module: and the system is responsible for acquiring the document picture data which is obtained after the processing of the watermark embedding module and is hidden with the watermark information, extracting the watermark information and further tracing the document leakage source.
The invention has the following beneficial effects:
according to the invention, according to the unique watermark coding data information of the user terminal, the personalized watermark word stock is dynamically generated at the client, and the unique ID identification information is embedded in each watermark word stock. When the user identity information of different clients is different, the watermark information embedded in the watermark font library is also different. After the watermark font library is loaded in real time, all application software systems calling the local font library to perform printing output and screen display embed watermark information in the file content in real time. Therefore, the text watermarking method has the advantages of strong universality, good compatibility, high stability and simple watermark information embedding process. In addition, due to the adoption of a packet unordered embedding strategy, the text watermarking method has higher robustness for resisting malicious attacks such as cutting, kneading, fouling, tearing and the like.
Drawings
Fig. 1 is a schematic flow chart of an implementation of a general text watermark embedding and extracting method described in the embodiment;
FIG. 2 is a schematic diagram of an arcuate grouping method;
FIG. 3 is a diagram of a temporary file structure of watermark character data;
fig. 4 is a schematic diagram of error correction encoding of watermark information in the method according to the embodiment;
FIG. 5 is a diagram illustrating a process of dynamically generating a watermark font library;
fig. 6 is a schematic structural diagram of a device for embedding and extracting a general text watermark in an embodiment.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, the present invention shall be described in further detail with reference to the following detailed description and accompanying drawings.
Fig. 1 is a schematic flow chart of an implementation of a general text watermark embedding and extracting method in the embodiment.
S101, grouping a certain number of characters in the selected word stock according to a specific strategy.
In the invention, in order to realize universal watermark information embedding, a unique personalized watermark word stock is dynamically generated at each client, namely, the watermark word stocks generated by different clients are different (which is different from any existing technical scheme, namely, the watermark word stocks generated and installed by each client in the existing technical scheme are the same), and corresponding user terminal identity authentication information is embedded in each watermark word stock. After the watermark word stock is generated, the watermark word stock is dynamically loaded in real time and replaces a standard word stock which is pre-installed in an operating system, and when all application software systems calling watermark fonts of a client side perform screen display and printout operations of an electronic document, watermark information is automatically embedded into the document content in real time. Therefore, the watermarking method has strong universality, simple implementation process and good compatibility with a system and other application software. But in order to ensure sufficient information capacity and watermark extraction efficiency, we represent a specific watermark bit string by a group of characters, i.e. all characters in the group represent the same watermark information bit string. When at least one character of each packet is present in the text content, the corresponding bit string of watermark information can be correctly extracted. Therefore, a certain number of characters in the selected word stock are reasonably grouped according to a specific strategy, so that the probability that the characters in each group appear in the text content is relatively high.
The specific grouping method comprises the following steps:
firstly, sequencing common characters from high to low according to word frequency statistical results, and acquiring top N characters to form a watermark character set
Figure BDA0003492152240000061
In this embodiment, N is 2000.
Next, the N characters are preliminarily divided into M groups { omega11,…,ΩMIn which M is<N, and make arbitrary
Figure BDA0003492152240000062
Figure BDA0003492152240000063
0<i,j≤M,i≠j,
Figure BDA0003492152240000064
In the present embodiment, M is 28.
The specific grouping process is as follows:
step1. first M characters are selected according to the character frequency sequence
Figure BDA0003492152240000065
Are sequentially divided into { omega11,…,ΩMIn each group, a character is added in each group.
Step2. select M characters again
Figure BDA0003492152240000066
In reverse order { ΩMM-1,…,Ω1One character is added to each packet in turn.
Step3. repeat the steps 1 and 2 until the N characters are grouped.
The M groups of [ omega ] are preliminarily obtained by the arch grouping method shown in FIG. 211,…,ΩMFor example, the character set in the first group is:
Figure BDA0003492152240000071
then, a certain number of text corpora are randomly selected, and the number of characters in each text corpora is fixed to t, where t is 200 in this embodiment. In order to verify the probability value of each group of characters appearing in common text documents, a large number of samples need to be collected for training tests. Therefore, nearly 50 million articles are downloaded by means of internet crawlers, wherein the articles cover the fields of politics, military affairs, news, sports, culture, history, finance and the like. And (3) storing each collected article as a text training corpus of 200 characters after content filtering and clipping operations.
Finally, all packets { Ω ] are counted11,…,ΩMProbability of characters in the text training corpus t is obtained, and according to probability distribution, grouping results are optimized to obtain final grouping { omega'1,Ω′2,…,Ω′M}。
Because the grouping operation is performed only based on the word frequency sorting result in the method, the situation of uneven probability distribution may occur in the actual text corpus training process, and therefore, a specific optimization operation is required to obtain more balanced grouping. The specific grouping optimization method comprises the following steps:
step1, calculate each packet ΩiThe probability of the character in the M is more than or equal to 1 and less than or equal to M in the training corpus, and all the probabilities are sequenced from large to small;
step2, extracting the character with the minimum word frequency from the grouping with the highest probability and moving the character into the grouping with the lowest probability, extracting the character with the minimum word frequency from the grouping with the next highest probability and moving the character into the grouping with the next lower probability, and repeating the process in sequence until the movement is completely finished;
step3, repeating the steps 1 and 2 again until the probability variance of all the packets reaches the minimum, thereby obtaining the optimal packet { omega'1,Ω′2,…,Ω′M}。
S102, performing deformation design on all characters in each group according to a specific rule, and generating a watermark character data temporary file.
The character deformation design characteristic refers to that vector outline curve data of a character is adjusted to obtain d different deformations, wherein d is larger than or equal to 2, and the different outline curve deformations represent different watermark information bit strings. In order to effectively increase the watermark information capacity, the number of character variations usually exceeds 2, and in this embodiment, d is 4. I.e., 4 different variants of each character, represents a 2-bit watermark information bit string. In addition, it should be noted that all characters in each packet represent the same watermark information bit string.
And storing the character deformation contour curve data generated by the design in a temporary file together with character attribute description information, wherein the character attribute description information comprises the width of a font frame, the height of the font frame and the offset of each font in the font frame, and the offset can be changed along with the difference of font structures. In the temporary file, the storage structure of each font datum is as shown in fig. 3. The specific information is described as follows:
UNICODE encoding of characters: assigning a unique UNICODE code to the character represented by the glyph structure in the temporary file;
horizontal layout: the method comprises the following steps of (1) including the width of a font outer frame and the distance from the leftmost point of a font contour line to a font left frame;
vertical layout: the height of the font outer frame and the distance from the topmost point of the font contour line to the font upper frame are included;
size of primitive data: the capacity of the vector outline curve data of the character pattern structure is represented, and the unit is byte;
primitive data: an array of BYTE types stores vector outline curve data of a specific glyph structure, and also includes the definition of a grid and associated instruction data.
S103, generating watermark coding data of the user terminal to identify the identity authentication information of the user terminal.
In order to effectively track and trace a client-side divulgence event, unique watermark coding data information needs to be generated for a user terminal, the content includes identity authentication information and time information of the user terminal, and a specific generation method includes a manual designation mode and an automatic distribution mode. The manual designation mode refers to that when the client system is installed and deployed, unique encoding information is manually designated for the client. The method for automatically distributing the watermark coding data characteristics comprises the following steps:
step1, creating a user terminal identity identification information recording table in a system background, wherein the information content comprises user ID, a user login account, a machine MAC address and machine IP address information, the user ID is automatically allocated and added by the background system, and the rest information is automatically submitted for a client monitoring program;
and step2, running a client monitoring program, automatically acquiring identity identification information of the user terminal, uploading the identity identification information to a system background, directly returning the user ID information when the uploaded identification information exists in a record table of a database of the system background, or adding a new record in the database of the system background, adding 1 to the user ID, and returning the user ID to the client.
And step3, after receiving the user ID information returned by the system background, the client monitoring process reads the system operation time in real time, and performs error correction coding processing on the user ID information and the time information to obtain the final user terminal watermark coding data.
Aiming at a watermark information bit string with the length Len, Len is a multiple of 8, and a parity check mode is adopted to carry out error correction coding processing on the watermark information to obtain complete watermark coding data, and the specific process is as follows: arranging watermark information bit strings into a matrix of Len/8 rows and 8 columns; the parity check code for each column of information is calculated to form the Len/8+1 th row, and the total length of the valid information code and the valid parity check code is Len + 8. In this embodiment, let Len be 48, the watermark information bit string is arranged as shown in fig. 4, the first 6 rows are valid watermark information bit strings, and the information in each column of the last row (the gray part in the figure) is the parity of the information bit string in the first 6 rows of the column where the information is located, so that the final watermark encoded data amount is 56 bits. Whereas in the present embodiment 28 packets are selected, each representing 2 bits of watermark information, exactly 56 bits of watermark encoded data can be embedded.
And S104, dynamically generating and loading the watermark font file in real time according to the watermark coding data.
And the process of dynamically generating the watermark font library refers to opening a temporary file of the watermark character data, dynamically reading font outline curve data from the temporary file according to a grouping strategy and watermark coding information, and updating corresponding font structure data in a standard font library installed in the system. The specific process is shown in fig. 5.
Step1, firstly, key data of a system installation standard word stock file are required to be analyzed, wherein the key data comprise a font information header, a maximum requirement table, a naming table, a font coding mapping table, a primitive position index table, a horizontal layout degree scale and a vertical layout degree scale of a font, primitive data and the like;
step2, reading a temporary file containing watermark character outline curve data;
step3, initializing an empty font coding mapping table M, a primitive position index table S, a horizontal layout degree table H and a vertical layout degree table V, and generating an empty primitive data file tmp for storing the primitive data information of the vector;
step4, writing each character data in turn, and setting the data quantity of the primitive data corresponding to the ith character currently processed and all the previously processed i-1 characters as dwSi-1Then the value S of the i-1 th item in the primitive position index tablei-1=dwSi-1At this time, the ith character processing procedure is as follows:
1) updating a font code mapping table M according to the UNICODE codes represented by the characters in the font library;
2) reading the horizontal and vertical layout information of the character and the data volume dw of the new primitive from the temporary fileNAnd vector outline data of the new primitive, and update to the corresponding table of the target watermark word stock;
3) updating a primitive position index table of the target watermark font file as follows: si=Si-1+dwN
4) And saving the primitive data read from the temporary file into a primitive data file tmp.
And step5, writing the font file header information and the related attribute value information, the font coding mapping table, the primitive position index table and the horizontal and vertical layout degree table in sequence according to the structure of the font file, and finally writing all the primitive data stored in the primitive data file tmp into a primitive data area of a newly generated font library to generate a new watermark font library file.
As described above, after the watermark font library is generated according to the watermark encoding information, the font resource table of the operating system needs to be updated to complete the correct call of the application software system to the watermark font library. According to different operating systems, the loading process of the watermark word stock is divided into the following two cases:
1) windows environment.
Firstly, calling a system API function RemoveFontResource (PCTSTR lpFileName) to remove a standard word library installed by the system from a system word table; sending a WM _ FONTCHANGG message to all top windows in the system to inform the change; then, an AddFontResource (LPCTS lpszFilename) function is called to add the dynamically generated water-based font file resource to the system font table for normal use by the Windows application program.
2) A Linux environment.
The standard fonts installed by the system are uninstalled by deleting the font files from the corresponding folders. When the global font is deleted, an fc-cache-fv command is sent out to update the font cache of the system. And copying the dynamically generated water lettering library file into a corresponding file directory, and sending an fc-cache-fv command to enable the system to know the change. When fc-cache is completed, all users on the system can access the newly added global fonts.
And S105, running the text file in the electronic format, and embedding watermark information in the document content data of the file printout and the screen display in real time.
As described above, in the present invention, the watermark font file is dynamically generated and updated to the font table of the operating system to replace the standard font file installed by the operating system, so that the application program of the system automatically calls the newly loaded watermark font file, thereby completing the real-time embedding of the watermark information.
S106, obtaining the document picture data with the hidden watermark information, extracting the watermark information, and tracing the document divulgence source.
The watermark information extraction process mainly comprises three steps:
1) according to the distribution of character groups in the document content, respectively extracting watermark information bit strings represented by all characters in each group;
2) checking and correcting the preliminarily extracted watermark information bit string by using a watermark information bit string parity check rule;
3) and finally, splicing all the information strings extracted by the groups to obtain a final complete watermark information bit string.
It should be noted that when the same character appears at different positions in the document content, the represented watermark information bit string is the same, and the watermark information represented by all characters that appear in the document content and belong to a group is also the same. In the extraction of watermark information, in order to consider the efficiency of processing, a "multiple" threshold p for watermark character extraction in each packet is defined, that is, watermark information is extracted at most p times in each packet. When the number of occurrences of the same character in the packet exceeds p, or the number of characters contained exceeds p, the watermark extraction process is only run p times. Otherwise, the watermark extraction operation will be performed for all the appearing characters.
As shown in fig. 6, based on the same inventive concept, the present invention further provides a general text watermark embedding and extracting apparatus, including:
character grouping module 1: the system is responsible for grouping a certain number of characters in the selected word stock according to a specific strategy;
font design module 2: the system is responsible for performing deformation design on all characters in each group processed by the character grouping module according to a specific rule and generating a watermark character data temporary file;
watermark coding generation module 3: the system is responsible for generating watermark coding data of the user terminal to identify the identity authentication information of the user terminal;
the word stock generation and loading module 4: according to the watermark encoding data generated by the watermark encoding generation module, combining the watermark character data temporary file generated by the character design module and the grouping characters obtained by the character grouping module, taking charge of dynamically generating and loading the watermark font file in real time;
the watermark embedding module 5: the system is in charge of operating a text file in an electronic format, and embedding watermark information into a watermark font file generated and loaded by a font generation and loading module in real time when the file is printed out or the content of the file is displayed on a screen;
the watermark extraction module 6: and the system is responsible for acquiring the document picture data which is obtained after the processing of the watermark embedding module and is hidden with the watermark information, extracting the watermark information and further tracing the document leakage source.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.
For example, in order to prevent the dynamically generated watermark font library from being maliciously tampered, or the watermark font library from being damaged by reinstallation of a related font by a user, a watermark font library detection service program is deployed and installed on the client. And the detection service program calculates the SHA1 value of the newly generated water printing library file and records the value in a system ledger. The detection service program scans the loaded watermark font library file at regular time, calculates the SHA1 value of the current watermark font library file, and compares the value with the value recorded in the system ledger. If the two are different, the watermark font file is destroyed, and the watermark font generation and loading processes are executed again.
For example, the generated watermark encoding data of the user terminal includes the identity authentication information and the time information of the user terminal. In order to accurately record different time information every day, a watermark font library generation time information monitoring program can be deployed and installed on the client side. When the operating system is restarted, the monitoring program executes the dynamic generation and real-time loading work of the watermark font library and records the effective time of the current watermark font library. And during the normal operation of the operating system, the monitoring program regularly detects the effective time of the watermark font library. And if the current time is not in the same day as the effective time, the monitoring program re-executes the dynamic generation and real-time loading work, and updates the effective time of the watermark font library again.
For example, in order to improve the dynamic generation efficiency of the watermark word stock of the user terminal, a watermark word stock generation time information monitoring program is deployed and installed at the client, and the corresponding watermark word stock is generated in advance according to different time periods. When the operating system is restarted, the monitoring program selects the corresponding watermark font file according to the current time period of the system, executes the copy and real-time loading work of the watermark font file, and then deletes the used overdue watermark font file.
For example, in the character grouping optimization process, the grouping strategy can be further optimized by splitting words. For example, the characters "aim" and "are present simultaneously in the first packet, while the word" aim "is a high frequency word, i.e. in normal text content the probability of the simultaneous presence of" aim "and" is relatively high. Therefore, the 'target' words with relatively low word frequency can be moved to the packets with lower probability in the current optimization stage, so that the more probability-equalized packet result can be obtained.
For example, in the method for generating a watermark font library according to embodiment 1, only one of the contour curve deformation data of each character is updated to the standard character encoding area, so that the system application software can be normally used without any problems such as messy code display. When the contour curve data of all character deformations of each character is copied to the extended coding area of the word stock, and a unique UNICODE code is respectively allocated to each character deformation, more watermark information can be embedded in the text content with fewer characters by dynamically replacing the character codes in the process of document printout or outgoing management, thereby improving the watermark information capacity.

Claims (10)

1. A universal text watermarking method, comprising the steps of:
grouping a certain number of characters in the selected word stock according to a specific strategy;
performing deformation design on all characters in each group according to a specific rule, and generating a watermark character data temporary file;
generating watermark coding data of the user terminal to identify the identity authentication information of the user terminal;
dynamically generating and loading a watermark font file in real time according to the watermark encoding data and by combining the watermark character data temporary file and the grouped characters;
and running the text file in an electronic format, and embedding watermark information in the document content data printed out of the file and displayed on a screen in real time by using the watermark font file.
2. The method of claim 1, further comprising the steps of:
and acquiring document picture data hidden with watermark information, extracting the watermark information, and tracing a document leakage source.
3. The method of claim 1 or 2, wherein grouping a number of characters in the selected word stock comprises:
firstly, sequencing common characters from high to low according to word frequency statistical results, and acquiring top N characters to form a watermark character set
Figure FDA0003492152230000011
Next, the N characters are initially divided into M groups, denoted as { Ω11,…,ΩMIn which M is<N, and make arbitrary
Figure FDA0003492152230000012
The specific grouping process comprises the following steps:
step1. first M characters are selected according to the character frequency sequence
Figure FDA0003492152230000013
Are sequentially divided into { omega11,…,ΩMIn each group, adding one character in each group;
step2. select M characters again
Figure FDA0003492152230000014
In reverse order { ΩMM-1,…,Ω1Adding a character in each group in turn;
step3, repeatedly executing Step1 and Step2 until the grouping of the N characters is completed;
then, randomly selecting a certain number of text training corpora, and fixing the number of characters of each text training corpus at t;
finally, all the statisticsGrouping { omega11,…,ΩMProbability of characters in the text training corpus t is obtained, and according to probability distribution, grouping results are optimized to obtain final grouping { omega'1,Ω′2,…,Ω′M}。
4. The method according to claim 1 or 2, wherein the optimizing the grouping result according to the probability distribution to obtain the final grouping comprises:
step1, calculate each packet ΩiThe probability of the characters in the M is more than or equal to 1 and less than or equal to M appearing in t text training corpora, and all the probabilities are sequenced from large to small;
step2, extracting the character with the minimum word frequency from the grouping with the highest probability and moving the character into the grouping with the lowest probability, extracting the character with the minimum word frequency from the grouping with the next highest probability and moving the character into the grouping with the next lower probability, and repeating the process in sequence until the movement is completely finished;
step3, repeating the steps 1 and Step2 again until the probability variance of all the packets reaches the minimum, thereby obtaining the final packet { omega'1,Ω′2,…,Ω′M}。
5. The method according to claim 1 or 2, wherein the performing deformation design on all the characters in each group means that vector contour curve data of the characters are adjusted to obtain d different deformations, wherein d is larger than or equal to 2, and the different contour curve deformations represent different watermark information bit strings; all characters in each packet represent the same watermark information bit string; the generation of the temporary file of the watermark character data refers to the fact that character deformation contour curve data generated by design and character attribute description information are stored in the temporary file together, wherein the character attribute description information comprises the width of a font frame, the height of the font frame and the offset of each font in the font frame, and the offset can be changed along with the difference of font structures.
6. The method according to claim 1 or 2, characterized in that the user terminal watermark encoding data comprises the identity authentication information and the time information of the user terminal, and the user terminal watermark encoding data is generated by adopting a manual designation or automatic distribution mode; the automatic allocation mode comprises the following steps:
step1, creating a user terminal identity identification information recording table in a system background, wherein the information content comprises user ID, a user login account, a machine MAC address and machine IP address information, the user ID is automatically allocated and added by the background system, and the rest information is automatically submitted for a client monitoring program;
step2, operating a client monitoring program, automatically acquiring identity identification information of the user terminal and uploading the identity identification information to a system background, and directly returning user ID information when the uploaded identification information exists in a record table of a database of the system background, or else, adding a new record in the database of the system background, adding 1 to the user ID and returning the user ID to the client;
and step3, after receiving the user ID information returned by the system background, the client monitoring process reads the system operation time in real time, and performs error correction coding processing on the user ID information and the time information to obtain the final user terminal watermark coding data.
7. The method according to claim 6, wherein the error correction coding process is performed on the watermark information bit string with the length Len, Len is a multiple of 8, and the error correction coding process is performed in a parity check manner to obtain the complete watermark coding data, and the specific process is as follows: arranging watermark information bit strings into a matrix of Len/8 rows and 8 columns; the parity check code for each column of information is calculated to form the Len/8+1 th row, and the total length of the valid information code and the valid parity check code is Len + 8.
8. The method according to claim 1 or 2, wherein the dynamically generating and loading the watermark font file in real time means opening a temporary watermark character data file, dynamically reading font outline curve data from the temporary watermark character data file according to a grouping strategy and user terminal watermark encoding data, and updating corresponding font structure data in a system-installed standard font; the real-time loading of the watermark word stock file refers to the fact that the watermark word stock is loaded respectively according to different Windows and Linux operating systems.
9. The method of claim 2, wherein the extracting watermark information comprises:
according to the distribution of character groups in the document content, respectively extracting watermark information bit strings represented by all characters in each group;
checking and correcting the preliminarily extracted watermark information bit string by using a watermark information bit string parity check rule;
finally, splicing all watermark information bit strings extracted from the groups to obtain a final complete watermark information bit string.
10. A universal text watermarking apparatus using the method of any one of claims 1 to 9, comprising:
the character grouping module is responsible for grouping a certain number of characters in the selected word stock according to a specific strategy;
the character pattern design module is responsible for performing deformation design on all characters in each group processed by the character grouping module according to a specific rule and generating a watermark character data temporary file;
the watermark coding generation module is responsible for generating watermark coding data of the user terminal and is used for identifying the identity authentication information of the user terminal;
the font library generating and loading module is responsible for dynamically generating and loading the watermark font library file in real time according to the watermark encoding data generated by the watermark encoding generating module, and by combining the watermark character data temporary file generated by the font design module and the grouped characters obtained by the character grouping module;
the watermark embedding module is responsible for operating the text file in the electronic format, and embedding watermark information in real time by using the watermark word stock file generated and loaded by the word stock generating and loading module when the file is printed out or the content of the file is displayed on a screen;
and the watermark extraction module is in charge of acquiring the document picture data which is obtained by processing of the watermark embedding module and is hidden with the watermark information, extracting the watermark information and further tracing a document divulgence source.
CN202210100368.XA 2022-01-27 2022-01-27 Universal text watermarking method and device Active CN114708133B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210100368.XA CN114708133B (en) 2022-01-27 2022-01-27 Universal text watermarking method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210100368.XA CN114708133B (en) 2022-01-27 2022-01-27 Universal text watermarking method and device

Publications (2)

Publication Number Publication Date
CN114708133A true CN114708133A (en) 2022-07-05
CN114708133B CN114708133B (en) 2022-11-15

Family

ID=82166113

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210100368.XA Active CN114708133B (en) 2022-01-27 2022-01-27 Universal text watermarking method and device

Country Status (1)

Country Link
CN (1) CN114708133B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115455966A (en) * 2022-11-14 2022-12-09 合肥高维数据技术有限公司 Safe word stock construction method and safe code extraction method thereof
CN115455965A (en) * 2022-11-14 2022-12-09 合肥高维数据技术有限公司 Character grouping method based on word distance word chain, storage medium and electronic equipment
CN115455987A (en) * 2022-11-14 2022-12-09 合肥高维数据技术有限公司 Character grouping method based on word frequency and word frequency, storage medium and electronic equipment
CN115630343A (en) * 2022-12-20 2023-01-20 北京国隐科技有限公司 Electronic document information processing method, device and equipment
CN117891787A (en) * 2024-03-15 2024-04-16 武汉磐电科技股份有限公司 Current transformer quantity value tracing data processing method, system and equipment

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012132953A (en) * 2010-12-18 2012-07-12 Kimito Horie Configuration method and device of hyperbola cryptography using virtual surrogate pair
CN103136718A (en) * 2013-03-13 2013-06-05 上海理工大学 Anti-geometric-transformation practical color image digital watermark technology
CN106570356A (en) * 2016-11-01 2017-04-19 南京理工大学 Unicode coding-based text watermark embedding method and extraction method
CN107423629A (en) * 2017-04-12 2017-12-01 李晓妮 A kind of anti-method and system divulged a secret with tracing of fileinfo output
US20180096203A1 (en) * 2004-04-12 2018-04-05 Google Inc. Adding value to a rendered document
CN108763879A (en) * 2018-05-30 2018-11-06 北京溯斐科技有限公司 A kind of automatic generation method and device of watermark character library
CN110196968A (en) * 2019-06-06 2019-09-03 北京林业大学 A kind of simplified form of Chinese Character coding mode automatic recognition system and method searched based on specific character string
CN110674477A (en) * 2019-09-24 2020-01-10 北京溯斐科技有限公司 Document source tracing method and device based on electronic file security identification

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180096203A1 (en) * 2004-04-12 2018-04-05 Google Inc. Adding value to a rendered document
JP2012132953A (en) * 2010-12-18 2012-07-12 Kimito Horie Configuration method and device of hyperbola cryptography using virtual surrogate pair
CN103136718A (en) * 2013-03-13 2013-06-05 上海理工大学 Anti-geometric-transformation practical color image digital watermark technology
CN106570356A (en) * 2016-11-01 2017-04-19 南京理工大学 Unicode coding-based text watermark embedding method and extraction method
CN107423629A (en) * 2017-04-12 2017-12-01 李晓妮 A kind of anti-method and system divulged a secret with tracing of fileinfo output
CN108763879A (en) * 2018-05-30 2018-11-06 北京溯斐科技有限公司 A kind of automatic generation method and device of watermark character library
CN110196968A (en) * 2019-06-06 2019-09-03 北京林业大学 A kind of simplified form of Chinese Character coding mode automatic recognition system and method searched based on specific character string
CN110674477A (en) * 2019-09-24 2020-01-10 北京溯斐科技有限公司 Document source tracing method and device based on electronic file security identification

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
HALDER RAJU,ET AL.: "Watermarking Techniques for Relational Databases: Survey, Classification and Comparison", 《JOURNAL OF UNIVERSAL COMPUTER SCIENCE》 *
YAN LI,ET AL.: "A Robust and ReversibleWatermarking Algorithm for a Relational Database Based on Continuous Columns in Histogram", 《MATHEMATICS》 *
孙杉 等: "中文水印字库的自动生成方法", 《中国图象图形学报》 *
孙杉: "基于自动生成字库的中文鲁棒文档水印方法", 《中国优秀硕士论文全文数据库 信息科技辑》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115455966A (en) * 2022-11-14 2022-12-09 合肥高维数据技术有限公司 Safe word stock construction method and safe code extraction method thereof
CN115455965A (en) * 2022-11-14 2022-12-09 合肥高维数据技术有限公司 Character grouping method based on word distance word chain, storage medium and electronic equipment
CN115455987A (en) * 2022-11-14 2022-12-09 合肥高维数据技术有限公司 Character grouping method based on word frequency and word frequency, storage medium and electronic equipment
CN115455966B (en) * 2022-11-14 2023-03-10 合肥高维数据技术有限公司 Safe word stock construction method and safe code extraction method thereof
CN115455965B (en) * 2022-11-14 2023-03-10 合肥高维数据技术有限公司 Character grouping method based on word distance word chain, storage medium and electronic equipment
CN115455987B (en) * 2022-11-14 2023-05-05 合肥高维数据技术有限公司 Character grouping method based on word frequency and word frequency, storage medium and electronic equipment
CN115630343A (en) * 2022-12-20 2023-01-20 北京国隐科技有限公司 Electronic document information processing method, device and equipment
CN117891787A (en) * 2024-03-15 2024-04-16 武汉磐电科技股份有限公司 Current transformer quantity value tracing data processing method, system and equipment
CN117891787B (en) * 2024-03-15 2024-05-28 武汉磐电科技股份有限公司 Current transformer quantity value tracing data processing method, system and equipment

Also Published As

Publication number Publication date
CN114708133B (en) 2022-11-15

Similar Documents

Publication Publication Date Title
CN114708133B (en) Universal text watermarking method and device
US7478325B2 (en) Methods for providing an accurate visual rendition of a text element formatted with an unavailable font
US6560598B2 (en) Internal database validation
US9348799B2 (en) Forming a master page for an electronic document
US20050171965A1 (en) Contents reuse management apparatus and contents reuse support apparatus
US7523318B2 (en) Method and system for automated password generation
CN111143368B (en) Relational database data comparison method and system
US7333992B2 (en) System and method for identifying and storing changes made to a table
CN110362542B (en) Nuclear power station file encoding method and device, computer equipment and storage medium
CN115630343B (en) Electronic document information processing method, device and equipment
CN111159497A (en) Regular expression generation method and regular expression-based data extraction method
CN111144117A (en) Knowledge graph Chinese address disambiguation method
CN105488471B (en) A kind of font recognition methods and device
CN105630748A (en) Information processing device and information processing method
CN114356919A (en) Watermark embedding method, tracing method and device for structured database
CN114386853A (en) Data auditing processing method, device and equipment based on universal auditing model
CN115618809A (en) Character grouping method based on binary character frequency and safe word stock construction method
CN112541505B (en) Text recognition method, text recognition device and computer-readable storage medium
CN104516899A (en) Updating method and device for word stock
CN114387429A (en) Vehicle property right processing method, device, equipment and medium based on RPA and AI
CN115455966B (en) Safe word stock construction method and safe code extraction method thereof
CN112148710B (en) Micro-service library separation method, system and medium
CN115455987B (en) Character grouping method based on word frequency and word frequency, storage medium and electronic equipment
Abdullah et al. X_myKarve: Non-contiguous JPEG file carver
CN115311670A (en) Calendar analysis method and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant