CN114708133A

CN114708133A - Universal text watermarking method and device

Info

Publication number: CN114708133A
Application number: CN202210100368.XA
Authority: CN
Inventors: 李公宝; 丛升日
Original assignee: Beijing Guoyin Technology Co ltd
Current assignee: Beijing Guoyin Technology Co ltd
Priority date: 2022-01-27
Filing date: 2022-01-27
Publication date: 2022-07-05
Anticipated expiration: 2042-01-27
Also published as: CN114708133B

Abstract

The invention relates to a universal text watermarking method and a universal text watermarking device. The method comprises the following steps: grouping a certain number of characters in the selected word stock according to a specific strategy; performing deformation design on all characters in each group according to a specific rule, and generating a watermark character data temporary file; generating watermark coding data of the user terminal to identify the identity authentication information of the user terminal; dynamically generating and loading a watermark font file in real time according to the watermark encoding data and by combining the watermark character data temporary file and the grouped characters; running a text file in an electronic format, and embedding watermark information in document content data printed out of the file and displayed on a screen in real time by using a watermark font file; and acquiring document picture data hidden with watermark information, extracting the watermark information, and tracing a document leakage source. The text watermarking method has the advantages of strong universality, good compatibility, high stability and simple watermark information embedding process.

Description

Universal text watermarking method and device

Technical Field

The invention belongs to the technical field of document protection and image processing, relates to a method and a device for embedding and extracting a digital watermark, and particularly relates to a universal text watermarking method and a universal text watermarking device.

Background

In the existing text watermarking technology, in order to improve the robustness of a watermarking algorithm against malicious attacks such as printing and scanning, screen capture, screen shooting and the like, a text digital watermarking technology based on character topological structure modification becomes a mainstream. The character deformation data is stored in a specific watermark font library by corresponding to different watermark information bit strings after the specific characters are deformed in different forms, and the watermark information is embedded by font replacement in the process of printing and outputting electronic text documents and displaying screens. Therefore, real-time loading of watermark information is a key step of text watermarking technology based on font replacement. Currently, the loading of the word stock and the embedding of the watermark information are mainly realized through the following modes: 1) via HOOK technology. In order to be able to replace font information in real time, it is necessary to acquire content data of a text file in advance. The specific operation of the file is generally intercepted by a system hook, then the intermediate format file data is obtained, the watermark information is embedded after the font replacement operation is carried out, and finally the normal file operation is executed. For example, watermark information can be embedded in real time in the process of normally printing and outputting a document through the printing operation of a HOOK electronic text file; or completing the analysis of the content data in the specific file format after intercepting the opening operation of the file, executing font type replacement and embedding watermark information, and finally opening the electronic file embedded with the watermark information. 2) By means of a software plug-in mechanism. In order to replace the word stock in the specific file format content and embed the watermark information, a corresponding software plug-in module can be developed to complete the file content analysis. For example, based on vba (visual Basic for applications) macro or vsto (visual Studio Tools for Office) technology of microsoft Office software, watermark information embedding in common Office format files such as Doc, Docx, Excel or PPT is realized. When the electronic file with the format is opened or printed out, the office plug-in calls an interface provided by a software system to finish analyzing and modifying the content of the electronic file, so that watermark information embedding is finished after word stock type replacement.

However, the above methods have many problems and disadvantages: 1) the general use is poor. For example, watermark information embedding in the process of file printing operation can be well completed based on the HOOK technology, but the screen display operation of the electronic file is difficult to intercept, and watermark information embedding cannot be carried out in real time in the screen display content; the embedding method based on the software plug-in mechanism is related to a specific software system, not all software systems provide secondary development interfaces, and related limitations are larger. 2) In addition, the existing method is complex in implementation process, and particularly plug-in development based on a web browser is more difficult. 3) The HOOK technology has poor stability, more compatibility problems with system software and lower safety performance. 4) The current watermark embedding algorithm carries out watermark embedding by adopting a sequential redundancy mode aiming at each page of document data, which requires that the watermark information content has local integrity. Therefore, the watermark algorithm has poor performance in resisting malicious attacks such as clipping, rubbing, fouling, tearing and the like. In view of the above problems, the present invention provides a general text watermarking solution.

Disclosure of Invention

The invention provides a method and a device for embedding and extracting a universal text watermark generated based on a dynamic word stock, which are used for solving the problems of poor universality of watermark loading, poor system stability, complex implementation process, low robustness performance of a watermark algorithm and the like in the prior art on the premise of not changing any use habit of a user.

The invention is characterized in that firstly, a certain number of characters in a selected character library are uniformly grouped according to a specific strategy, and all the characters in each group represent the same watermark information bit string; performing deformation design on all characters in each group according to a specific rule, respectively obtaining a plurality of watermark character contour curve data corresponding to each character, and generating a watermark character data temporary file; generating watermark coding data of the user terminal according to a specific rule so as to identify the identity authentication information of the user terminal; according to the watermark coding data, dynamically generating a watermark font file through a watermark character data temporary file, wherein the watermark font file has the same attribute with a same-name font file installed in a system; loading the watermark font file in real time and replacing the same-name font file installed in the system; running a text file in an electronic format, and embedding watermark information in document content data of file printout and screen display in real time; and acquiring document picture data with hidden watermark information, extracting the watermark information, and tracing a document divulgence source. Therefore, a universal text watermark embedding and extracting method and device are obtained.

The invention discloses a universal text watermark method, which comprises a text watermark embedding and extracting method, wherein the text watermark embedding comprises the following steps:

step one, grouping a certain number of characters in a selected word stock according to a specific strategy;

step two, performing deformation design on all characters in each group according to a specific rule, and generating a watermark character data temporary file;

generating watermark encoding data of the user terminal to identify the identity authentication information of the user terminal;

step four, dynamically generating and loading a watermark font file in real time according to watermark encoding data and combining a watermark character data temporary file and grouped characters;

and step five, operating the text file in the electronic format, and embedding watermark information in the document content data printed out of the file and displayed on the screen in real time by using the watermark font file.

Further, the method also comprises a text watermark extraction step, namely the step six: and acquiring document picture data hidden with watermark information, extracting the watermark information, and tracing a document leakage source.

Preferably, the method for grouping a certain number of characters in the selected word stock comprises:

firstly, sequencing common characters from high to low according to word frequency statistical results, and acquiring top N characters to form a watermark character set

Next, the N characters are initially divided into M groups, which are denoted as { Ω₁,Ω₁,…,Ω_MIn which M is<N, and make arbitrary

0<i,j≤M，i≠j，

The specific grouping process is as follows:

step1. first M characters are selected according to the character frequency sequence

Are sequentially divided into { omega₁,Ω₁,…,Ω_MIn each group, adding a character in each group;

step2. select M characters again

In reverse order { Ω_M,Ω_M-1,…,Ω₁Adding a character in each group in turn;

and Step3, repeatedly executing Step1 and Step2 until the N characters are grouped.

Then, randomly selecting a certain number of text training corpora, and fixing the number of characters of each text training corpus at t;

finally, all packets { Ω ] are counted₁,Ω₁,…,Ω_MProbability of characters in the text training corpus t is obtained, and according to probability distribution, grouping results are optimized to obtain final grouping { omega'₁,Ω′₂,…,Ω′_M}。

Preferably, the method of optimizing the grouping result according to the probability distribution to obtain the final grouping includes:

step1, calculate each packet Ω_iThe probability of the characters in the M is more than or equal to 1 and less than or equal to M appearing in t text training corpora, and all the probabilities are sequenced from large to small;

step2, extracting the character with the minimum word frequency from the grouping with the highest probability and moving the character into the grouping with the lowest probability, extracting the character with the minimum word frequency from the grouping with the next highest probability and moving the character into the grouping with the next lower probability, and repeating the process in sequence until the movement is completely finished;

step3, repeating the steps 1 and Step2 again until the probability variance of all the packets reaches the minimum, thereby obtaining the final packet { omega'₁,Ω′₂,…,Ω′_M}。

Preferably, the performing of the deformation design on all the characters in each group means that the vector contour curve data of the characters is adjusted to obtain d different deformations, wherein d is larger than or equal to 2, and the different contour curve deformations represent different watermark information bit strings;

preferably, the watermark information bit string, all characters in each packet represent the same watermark information bit string.

Preferably, the generating of the temporary file of watermark character data refers to storing the designed and generated character deformation contour curve data in the temporary file together with character attribute description information, where the character attribute description information includes a width of a font frame, a height of the font frame, and an offset of each font in the font frame, where the offset may vary with different font structures.

Preferably, the generating of the user terminal watermark encoding data includes identity authentication information and time information of the user terminal, and the specific generating method includes a manual designation and an automatic allocation mode.

Preferably, the automatic allocation method includes:

step1, creating a user terminal identity identification information recording table in a system background, wherein the information content comprises user ID, a user login account, a machine MAC address and machine IP address information, the user ID is automatically allocated and added by the background system, and the rest information is automatically submitted for a client monitoring program.

And step2, running a client monitoring program, automatically acquiring identity identification information of the user terminal, uploading the identity identification information to a system background, directly returning the user ID information when the uploaded identification information exists in a record table of a database of the system background, or adding a new record in the database of the system background, adding 1 to the user ID, and returning the user ID to the client.

And step3, after receiving the user ID information returned by the system background, the client monitoring process reads the system operation time in real time, and performs error correction coding processing on the user ID information and the time information to obtain the final user terminal watermark coding data.

Preferably, the error correction coding processing refers to that, for the watermark information bit string with the length Len, Len is a multiple of 8, the watermark information error correction coding processing is performed in a parity check manner to obtain complete watermark coding data, and the specific process is as follows: arranging watermark information bit strings into a matrix of Len/8 rows and 8 columns; the parity check code for each column of information is calculated to form the Len/8+1 th row, and the total length of the valid information code and the valid parity check code is Len + 8.

Preferably, the dynamically generating and loading the watermark font file in real time means opening a temporary watermark character data file, dynamically reading font outline curve data from the temporary watermark character data file according to a grouping strategy and user terminal watermark encoding data, and updating corresponding font structure data in a standard font installed in the system.

Preferably, the dynamically generating and loading the watermark word stock file in real time means that the loading of the watermark word stock is respectively completed according to the difference between Windows and Linux operating systems:

1) a Windows environment.

Firstly, calling a system API function RemoveFontResource (PCTSTR lpFileName) to remove a standard font library installed by a system from a system font table; sending a WM _ FONTCHANGG message to all top windows in the system to inform the change; then, an AddFontResource (LPCTS lpszFilename) function is called to add the dynamically generated water-based font file resource to the system font table for normal use by the Windows application program.

2) A Linux environment.

The standard fonts installed by the system are uninstalled by deleting the font files from the corresponding folders. When the global font is deleted, an fc-cache-fv command is sent out to update the font cache of the system. And copying the dynamically generated water lettering library file into a corresponding file directory, and sending an fc-cache-fv command to enable the system to know the change. When fc-cache is completed, all users on the system can access the newly added global fonts.

Preferably, the extracting of the watermark information mainly includes three steps:

1) according to the distribution of character groups in the document content, respectively extracting watermark information bit strings represented by all characters in each group;

2) checking and correcting the preliminarily extracted watermark information bit string by using a watermark information bit string parity check rule;

3) finally, splicing all watermark information bit strings extracted from the groups to obtain a final complete watermark information bit string.

Based on the same inventive concept, the invention also provides a universal text watermarking device, which comprises:

a character grouping module: the system is responsible for grouping a certain number of characters in the selected word stock according to a specific strategy;

a font design module: the system is responsible for performing deformation design on all characters in each group processed by the character grouping module according to a specific rule and generating a watermark character data temporary file;

watermark coding generation module: the system is responsible for generating watermark coding data of the user terminal to identify the identity authentication information of the user terminal;

the word stock generation and loading module comprises: according to the watermark encoding data generated by the watermark encoding generation module, combining the temporary watermark character data file generated by the character design module and the grouping character obtained by the character grouping module, and taking charge of dynamically generating and loading the watermark font file in real time;

a watermark embedding module: the system is responsible for operating a text file in an electronic format, and watermark information is embedded into a watermark word stock file generated and loaded by a word stock generating and loading module in document content data of file printout and screen display in real time;

a watermark extraction module: and the system is responsible for acquiring the document picture data which is obtained after the processing of the watermark embedding module and is hidden with the watermark information, extracting the watermark information and further tracing the document leakage source.

The invention has the following beneficial effects:

according to the invention, according to the unique watermark coding data information of the user terminal, the personalized watermark word stock is dynamically generated at the client, and the unique ID identification information is embedded in each watermark word stock. When the user identity information of different clients is different, the watermark information embedded in the watermark font library is also different. After the watermark font library is loaded in real time, all application software systems calling the local font library to perform printing output and screen display embed watermark information in the file content in real time. Therefore, the text watermarking method has the advantages of strong universality, good compatibility, high stability and simple watermark information embedding process. In addition, due to the adoption of a packet unordered embedding strategy, the text watermarking method has higher robustness for resisting malicious attacks such as cutting, kneading, fouling, tearing and the like.

Drawings

Fig. 1 is a schematic flow chart of an implementation of a general text watermark embedding and extracting method described in the embodiment;

FIG. 2 is a schematic diagram of an arcuate grouping method;

FIG. 3 is a diagram of a temporary file structure of watermark character data;

fig. 4 is a schematic diagram of error correction encoding of watermark information in the method according to the embodiment;

FIG. 5 is a diagram illustrating a process of dynamically generating a watermark font library;

fig. 6 is a schematic structural diagram of a device for embedding and extracting a general text watermark in an embodiment.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, the present invention shall be described in further detail with reference to the following detailed description and accompanying drawings.

Fig. 1 is a schematic flow chart of an implementation of a general text watermark embedding and extracting method in the embodiment.

S101, grouping a certain number of characters in the selected word stock according to a specific strategy.

In the invention, in order to realize universal watermark information embedding, a unique personalized watermark word stock is dynamically generated at each client, namely, the watermark word stocks generated by different clients are different (which is different from any existing technical scheme, namely, the watermark word stocks generated and installed by each client in the existing technical scheme are the same), and corresponding user terminal identity authentication information is embedded in each watermark word stock. After the watermark word stock is generated, the watermark word stock is dynamically loaded in real time and replaces a standard word stock which is pre-installed in an operating system, and when all application software systems calling watermark fonts of a client side perform screen display and printout operations of an electronic document, watermark information is automatically embedded into the document content in real time. Therefore, the watermarking method has strong universality, simple implementation process and good compatibility with a system and other application software. But in order to ensure sufficient information capacity and watermark extraction efficiency, we represent a specific watermark bit string by a group of characters, i.e. all characters in the group represent the same watermark information bit string. When at least one character of each packet is present in the text content, the corresponding bit string of watermark information can be correctly extracted. Therefore, a certain number of characters in the selected word stock are reasonably grouped according to a specific strategy, so that the probability that the characters in each group appear in the text content is relatively high.

The specific grouping method comprises the following steps:

In this embodiment, N is 2000.

Next, the N characters are preliminarily divided into M groups { omega₁,Ω₁,…,Ω_MIn which M is<N, and make arbitrary

0<i,j≤M,i≠j，

In the present embodiment, M is 28.

The specific grouping process is as follows:

Are sequentially divided into { omega₁,Ω₁,…,Ω_MIn each group, a character is added in each group.

Step2. select M characters again

In reverse order { Ω_M,Ω_M-1,…,Ω₁One character is added to each packet in turn.

Step3. repeat the

steps

1 and 2 until the N characters are grouped.

The M groups of [ omega ] are preliminarily obtained by the arch grouping method shown in FIG. 2₁,Ω₁,…,Ω_MFor example, the character set in the first group is:

then, a certain number of text corpora are randomly selected, and the number of characters in each text corpora is fixed to t, where t is 200 in this embodiment. In order to verify the probability value of each group of characters appearing in common text documents, a large number of samples need to be collected for training tests. Therefore, nearly 50 million articles are downloaded by means of internet crawlers, wherein the articles cover the fields of politics, military affairs, news, sports, culture, history, finance and the like. And (3) storing each collected article as a text training corpus of 200 characters after content filtering and clipping operations.

Because the grouping operation is performed only based on the word frequency sorting result in the method, the situation of uneven probability distribution may occur in the actual text corpus training process, and therefore, a specific optimization operation is required to obtain more balanced grouping. The specific grouping optimization method comprises the following steps:

step1, calculate each packet Ω_iThe probability of the character in the M is more than or equal to 1 and less than or equal to M in the training corpus, and all the probabilities are sequenced from large to small;

step3, repeating the

steps

1 and 2 again until the probability variance of all the packets reaches the minimum, thereby obtaining the optimal packet { omega'₁,Ω′₂,…,Ω′_M}。

S102, performing deformation design on all characters in each group according to a specific rule, and generating a watermark character data temporary file.

The character deformation design characteristic refers to that vector outline curve data of a character is adjusted to obtain d different deformations, wherein d is larger than or equal to 2, and the different outline curve deformations represent different watermark information bit strings. In order to effectively increase the watermark information capacity, the number of character variations usually exceeds 2, and in this embodiment, d is 4. I.e., 4 different variants of each character, represents a 2-bit watermark information bit string. In addition, it should be noted that all characters in each packet represent the same watermark information bit string.

And storing the character deformation contour curve data generated by the design in a temporary file together with character attribute description information, wherein the character attribute description information comprises the width of a font frame, the height of the font frame and the offset of each font in the font frame, and the offset can be changed along with the difference of font structures. In the temporary file, the storage structure of each font datum is as shown in fig. 3. The specific information is described as follows:

UNICODE encoding of characters: assigning a unique UNICODE code to the character represented by the glyph structure in the temporary file;

horizontal layout: the method comprises the following steps of (1) including the width of a font outer frame and the distance from the leftmost point of a font contour line to a font left frame;

vertical layout: the height of the font outer frame and the distance from the topmost point of the font contour line to the font upper frame are included;

size of primitive data: the capacity of the vector outline curve data of the character pattern structure is represented, and the unit is byte;

primitive data: an array of BYTE types stores vector outline curve data of a specific glyph structure, and also includes the definition of a grid and associated instruction data.

S103, generating watermark coding data of the user terminal to identify the identity authentication information of the user terminal.

In order to effectively track and trace a client-side divulgence event, unique watermark coding data information needs to be generated for a user terminal, the content includes identity authentication information and time information of the user terminal, and a specific generation method includes a manual designation mode and an automatic distribution mode. The manual designation mode refers to that when the client system is installed and deployed, unique encoding information is manually designated for the client. The method for automatically distributing the watermark coding data characteristics comprises the following steps:

step1, creating a user terminal identity identification information recording table in a system background, wherein the information content comprises user ID, a user login account, a machine MAC address and machine IP address information, the user ID is automatically allocated and added by the background system, and the rest information is automatically submitted for a client monitoring program;

Aiming at a watermark information bit string with the length Len, Len is a multiple of 8, and a parity check mode is adopted to carry out error correction coding processing on the watermark information to obtain complete watermark coding data, and the specific process is as follows: arranging watermark information bit strings into a matrix of Len/8 rows and 8 columns; the parity check code for each column of information is calculated to form the Len/8+1 th row, and the total length of the valid information code and the valid parity check code is Len + 8. In this embodiment, let Len be 48, the watermark information bit string is arranged as shown in fig. 4, the first 6 rows are valid watermark information bit strings, and the information in each column of the last row (the gray part in the figure) is the parity of the information bit string in the first 6 rows of the column where the information is located, so that the final watermark encoded data amount is 56 bits. Whereas in the present embodiment 28 packets are selected, each representing 2 bits of watermark information, exactly 56 bits of watermark encoded data can be embedded.

And S104, dynamically generating and loading the watermark font file in real time according to the watermark coding data.

And the process of dynamically generating the watermark font library refers to opening a temporary file of the watermark character data, dynamically reading font outline curve data from the temporary file according to a grouping strategy and watermark coding information, and updating corresponding font structure data in a standard font library installed in the system. The specific process is shown in fig. 5.

Step1, firstly, key data of a system installation standard word stock file are required to be analyzed, wherein the key data comprise a font information header, a maximum requirement table, a naming table, a font coding mapping table, a primitive position index table, a horizontal layout degree scale and a vertical layout degree scale of a font, primitive data and the like;

step2, reading a temporary file containing watermark character outline curve data;

step3, initializing an empty font coding mapping table M, a primitive position index table S, a horizontal layout degree table H and a vertical layout degree table V, and generating an empty primitive data file tmp for storing the primitive data information of the vector;

step4, writing each character data in turn, and setting the data quantity of the primitive data corresponding to the ith character currently processed and all the previously processed i-1 characters as dwS_i-1Then the value S of the i-1 th item in the primitive position index table_i-1＝dwS_i-1At this time, the ith character processing procedure is as follows:

1) updating a font code mapping table M according to the UNICODE codes represented by the characters in the font library;

2) reading the horizontal and vertical layout information of the character and the data volume dw of the new primitive from the temporary file_NAnd vector outline data of the new primitive, and update to the corresponding table of the target watermark word stock;

3) updating a primitive position index table of the target watermark font file as follows: s_i＝S_i-1+dw_N；

4) And saving the primitive data read from the temporary file into a primitive data file tmp.

And step5, writing the font file header information and the related attribute value information, the font coding mapping table, the primitive position index table and the horizontal and vertical layout degree table in sequence according to the structure of the font file, and finally writing all the primitive data stored in the primitive data file tmp into a primitive data area of a newly generated font library to generate a new watermark font library file.

As described above, after the watermark font library is generated according to the watermark encoding information, the font resource table of the operating system needs to be updated to complete the correct call of the application software system to the watermark font library. According to different operating systems, the loading process of the watermark word stock is divided into the following two cases:

1) windows environment.

Firstly, calling a system API function RemoveFontResource (PCTSTR lpFileName) to remove a standard word library installed by the system from a system word table; sending a WM _ FONTCHANGG message to all top windows in the system to inform the change; then, an AddFontResource (LPCTS lpszFilename) function is called to add the dynamically generated water-based font file resource to the system font table for normal use by the Windows application program.

2) A Linux environment.

And S105, running the text file in the electronic format, and embedding watermark information in the document content data of the file printout and the screen display in real time.

As described above, in the present invention, the watermark font file is dynamically generated and updated to the font table of the operating system to replace the standard font file installed by the operating system, so that the application program of the system automatically calls the newly loaded watermark font file, thereby completing the real-time embedding of the watermark information.

S106, obtaining the document picture data with the hidden watermark information, extracting the watermark information, and tracing the document divulgence source.

The watermark information extraction process mainly comprises three steps:

3) and finally, splicing all the information strings extracted by the groups to obtain a final complete watermark information bit string.

It should be noted that when the same character appears at different positions in the document content, the represented watermark information bit string is the same, and the watermark information represented by all characters that appear in the document content and belong to a group is also the same. In the extraction of watermark information, in order to consider the efficiency of processing, a "multiple" threshold p for watermark character extraction in each packet is defined, that is, watermark information is extracted at most p times in each packet. When the number of occurrences of the same character in the packet exceeds p, or the number of characters contained exceeds p, the watermark extraction process is only run p times. Otherwise, the watermark extraction operation will be performed for all the appearing characters.

As shown in fig. 6, based on the same inventive concept, the present invention further provides a general text watermark embedding and extracting apparatus, including:

character grouping module 1: the system is responsible for grouping a certain number of characters in the selected word stock according to a specific strategy;

font design module 2: the system is responsible for performing deformation design on all characters in each group processed by the character grouping module according to a specific rule and generating a watermark character data temporary file;

watermark coding generation module 3: the system is responsible for generating watermark coding data of the user terminal to identify the identity authentication information of the user terminal;

the word stock generation and loading module 4: according to the watermark encoding data generated by the watermark encoding generation module, combining the watermark character data temporary file generated by the character design module and the grouping characters obtained by the character grouping module, taking charge of dynamically generating and loading the watermark font file in real time;

the watermark embedding module 5: the system is in charge of operating a text file in an electronic format, and embedding watermark information into a watermark font file generated and loaded by a font generation and loading module in real time when the file is printed out or the content of the file is displayed on a screen;

the watermark extraction module 6: and the system is responsible for acquiring the document picture data which is obtained after the processing of the watermark embedding module and is hidden with the watermark information, extracting the watermark information and further tracing the document leakage source.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

For example, in order to prevent the dynamically generated watermark font library from being maliciously tampered, or the watermark font library from being damaged by reinstallation of a related font by a user, a watermark font library detection service program is deployed and installed on the client. And the detection service program calculates the SHA1 value of the newly generated water printing library file and records the value in a system ledger. The detection service program scans the loaded watermark font library file at regular time, calculates the SHA1 value of the current watermark font library file, and compares the value with the value recorded in the system ledger. If the two are different, the watermark font file is destroyed, and the watermark font generation and loading processes are executed again.

For example, the generated watermark encoding data of the user terminal includes the identity authentication information and the time information of the user terminal. In order to accurately record different time information every day, a watermark font library generation time information monitoring program can be deployed and installed on the client side. When the operating system is restarted, the monitoring program executes the dynamic generation and real-time loading work of the watermark font library and records the effective time of the current watermark font library. And during the normal operation of the operating system, the monitoring program regularly detects the effective time of the watermark font library. And if the current time is not in the same day as the effective time, the monitoring program re-executes the dynamic generation and real-time loading work, and updates the effective time of the watermark font library again.

For example, in order to improve the dynamic generation efficiency of the watermark word stock of the user terminal, a watermark word stock generation time information monitoring program is deployed and installed at the client, and the corresponding watermark word stock is generated in advance according to different time periods. When the operating system is restarted, the monitoring program selects the corresponding watermark font file according to the current time period of the system, executes the copy and real-time loading work of the watermark font file, and then deletes the used overdue watermark font file.

For example, in the character grouping optimization process, the grouping strategy can be further optimized by splitting words. For example, the characters "aim" and "are present simultaneously in the first packet, while the word" aim "is a high frequency word, i.e. in normal text content the probability of the simultaneous presence of" aim "and" is relatively high. Therefore, the 'target' words with relatively low word frequency can be moved to the packets with lower probability in the current optimization stage, so that the more probability-equalized packet result can be obtained.

For example, in the method for generating a watermark font library according to embodiment 1, only one of the contour curve deformation data of each character is updated to the standard character encoding area, so that the system application software can be normally used without any problems such as messy code display. When the contour curve data of all character deformations of each character is copied to the extended coding area of the word stock, and a unique UNICODE code is respectively allocated to each character deformation, more watermark information can be embedded in the text content with fewer characters by dynamically replacing the character codes in the process of document printout or outgoing management, thereby improving the watermark information capacity.

Claims

1. A universal text watermarking method, comprising the steps of:

grouping a certain number of characters in the selected word stock according to a specific strategy;

performing deformation design on all characters in each group according to a specific rule, and generating a watermark character data temporary file;

generating watermark coding data of the user terminal to identify the identity authentication information of the user terminal;

dynamically generating and loading a watermark font file in real time according to the watermark encoding data and by combining the watermark character data temporary file and the grouped characters;

and running the text file in an electronic format, and embedding watermark information in the document content data printed out of the file and displayed on a screen in real time by using the watermark font file.

2. The method of claim 1, further comprising the steps of:

and acquiring document picture data hidden with watermark information, extracting the watermark information, and tracing a document leakage source.

3. The method of claim 1 or 2, wherein grouping a number of characters in the selected word stock comprises:

Next, the N characters are initially divided into M groups, denoted as { Ω₁,Ω₁,…,Ω_MIn which M is<N, and make arbitrary

The specific grouping process comprises the following steps:

Are sequentially divided into { omega₁,Ω₁,…,Ω_MIn each group, adding one character in each group;

step2. select M characters again

step3, repeatedly executing Step1 and Step2 until the grouping of the N characters is completed;

finally, all the statisticsGrouping { omega₁,Ω₁,…,Ω_MProbability of characters in the text training corpus t is obtained, and according to probability distribution, grouping results are optimized to obtain final grouping { omega'₁,Ω′₂,…,Ω′_M}。

4. The method according to claim 1 or 2, wherein the optimizing the grouping result according to the probability distribution to obtain the final grouping comprises:

5. The method according to claim 1 or 2, wherein the performing deformation design on all the characters in each group means that vector contour curve data of the characters are adjusted to obtain d different deformations, wherein d is larger than or equal to 2, and the different contour curve deformations represent different watermark information bit strings; all characters in each packet represent the same watermark information bit string; the generation of the temporary file of the watermark character data refers to the fact that character deformation contour curve data generated by design and character attribute description information are stored in the temporary file together, wherein the character attribute description information comprises the width of a font frame, the height of the font frame and the offset of each font in the font frame, and the offset can be changed along with the difference of font structures.

6. The method according to claim 1 or 2, characterized in that the user terminal watermark encoding data comprises the identity authentication information and the time information of the user terminal, and the user terminal watermark encoding data is generated by adopting a manual designation or automatic distribution mode; the automatic allocation mode comprises the following steps:

step2, operating a client monitoring program, automatically acquiring identity identification information of the user terminal and uploading the identity identification information to a system background, and directly returning user ID information when the uploaded identification information exists in a record table of a database of the system background, or else, adding a new record in the database of the system background, adding 1 to the user ID and returning the user ID to the client;

7. The method according to claim 6, wherein the error correction coding process is performed on the watermark information bit string with the length Len, Len is a multiple of 8, and the error correction coding process is performed in a parity check manner to obtain the complete watermark coding data, and the specific process is as follows: arranging watermark information bit strings into a matrix of Len/8 rows and 8 columns; the parity check code for each column of information is calculated to form the Len/8+1 th row, and the total length of the valid information code and the valid parity check code is Len + 8.

8. The method according to claim 1 or 2, wherein the dynamically generating and loading the watermark font file in real time means opening a temporary watermark character data file, dynamically reading font outline curve data from the temporary watermark character data file according to a grouping strategy and user terminal watermark encoding data, and updating corresponding font structure data in a system-installed standard font; the real-time loading of the watermark word stock file refers to the fact that the watermark word stock is loaded respectively according to different Windows and Linux operating systems.

9. The method of claim 2, wherein the extracting watermark information comprises:

according to the distribution of character groups in the document content, respectively extracting watermark information bit strings represented by all characters in each group;

checking and correcting the preliminarily extracted watermark information bit string by using a watermark information bit string parity check rule;

finally, splicing all watermark information bit strings extracted from the groups to obtain a final complete watermark information bit string.

10. A universal text watermarking apparatus using the method of any one of claims 1 to 9, comprising:

the character grouping module is responsible for grouping a certain number of characters in the selected word stock according to a specific strategy;

the character pattern design module is responsible for performing deformation design on all characters in each group processed by the character grouping module according to a specific rule and generating a watermark character data temporary file;

the watermark coding generation module is responsible for generating watermark coding data of the user terminal and is used for identifying the identity authentication information of the user terminal;

the font library generating and loading module is responsible for dynamically generating and loading the watermark font library file in real time according to the watermark encoding data generated by the watermark encoding generating module, and by combining the watermark character data temporary file generated by the font design module and the grouped characters obtained by the character grouping module;

the watermark embedding module is responsible for operating the text file in the electronic format, and embedding watermark information in real time by using the watermark word stock file generated and loaded by the word stock generating and loading module when the file is printed out or the content of the file is displayed on a screen;

and the watermark extraction module is in charge of acquiring the document picture data which is obtained by processing of the watermark embedding module and is hidden with the watermark information, extracting the watermark information and further tracing a document divulgence source.