CN114708133B

CN114708133B - Universal text watermarking method and device

Info

Publication number: CN114708133B
Application number: CN202210100368.XA
Authority: CN
Inventors: 李公宝; 丛升日
Original assignee: Beijing Guoyin Technology Co ltd
Current assignee: Beijing Guoyin Technology Co ltd
Priority date: 2022-01-27
Filing date: 2022-01-27
Publication date: 2022-11-15
Anticipated expiration: 2042-01-27
Also published as: CN114708133A

Abstract

The invention relates to a universal text watermarking method and a universal text watermarking device. The method comprises the following steps: grouping a certain number of characters in the selected word stock according to a specific strategy; performing deformation design on all characters in each group according to a specific rule, and generating a watermark character data temporary file; generating watermark coding data of the user terminal to identify the identity authentication information of the user terminal; dynamically generating and loading a watermark font file in real time according to the watermark encoding data and by combining the watermark character data temporary file and the grouped characters; running a text file in an electronic format, and embedding watermark information in document content data printed out of the file and displayed on a screen in real time by using a watermark font file; and acquiring document picture data with hidden watermark information, extracting the watermark information, and tracing a document divulgence source. The text watermarking method has the advantages of strong universality, good compatibility, high stability and simple watermark information embedding process.

Description

Universal text watermarking method and device

Technical Field

The invention belongs to the technical field of document protection and image processing, relates to a method and a device for embedding and extracting a digital watermark, and particularly relates to a universal text watermarking method and a universal text watermarking device.

Background

With the development of electronic commerce and electronic government affairs, enterprises and public institutions, party administration, national security and other relevant departments will process a large amount of text materials, including contracts, secret-related important documents and the like. The research on copyright protection and content security of these text files is an important issue, and digital watermarking technology provides a way for solving the above problems. In addition, many text documents exist not only in a digital form but also in a paper form by means of printing, copying, and the like. With the rapid development of digitization technology, this approach has become quite common, which makes many important or confidential information easily leaked out by printing out paper documents or displaying electronic document screens as a transmission path. Therefore, it is important to research digital watermarking technology based on text files, which can resist print scanning and screen shooting.

In the existing text watermarking technology, in order to improve the robustness of a watermarking algorithm against malicious attacks such as printing and scanning, screen capture, screen shooting and the like, a text digital watermarking technology based on character topological structure modification becomes a mainstream. The specific character is deformed in different forms and then corresponds to different watermark information bit strings, character deformation data can be stored in a specific watermark font library, and in the process of printing output and screen display of an electronic text document, watermark information is embedded through font replacement. Therefore, real-time loading of watermark information is a key step of text watermarking technology based on font replacement. Currently, the loading of the word stock and the embedding of the watermark information are mainly realized through the following modes: 1) Via HOOK technology. In order to be able to replace font information in real time, it is necessary to acquire content data of a text file in advance. The specific operation of the file is generally intercepted by a system hook, then the intermediate format file data is obtained, the watermark information embedding is completed after the font replacing operation is carried out, and finally the normal file operation is executed. For example, the watermark information can be embedded in real time in the process of normally printing and outputting the document through the printing operation of the HOOK electronic text file; or completing the analysis of the content data in the specific file format during the opening operation of the intercepted file, executing font type replacement and embedding watermark information, and finally opening the electronic file embedded with the watermark information. 2) By means of a software plug-in mechanism. In order to replace the word stock in the specific file format content and embed the watermark information, a corresponding software plug-in module can be developed to complete the file content analysis. For example, based on the VBA (Visual Basic for Applications) macro or VSTO (Visual Studio Tools for Office) technology of microsoft Office software, watermark information embedding in common Office format files such as Doc, docx, excel or PPT is realized. When the electronic file with the format is opened or printed out, the office plug-in calls an interface provided by a software system to finish analyzing and modifying the content of the electronic file, so that watermark information embedding is finished after word stock type replacement.

However, the above methods have many problems and drawbacks: 1) The general use is poor. For example, watermark information embedding in the process of file printing operation can be well completed based on the HOOK technology, but the screen display operation of the electronic file is difficult to intercept, and watermark information embedding cannot be carried out in real time in the screen display content; the embedding method based on the software plug-in mechanism is related to a specific software system, not all software systems provide secondary development interfaces, and related limitations are larger. 2) In addition, the existing method is complex in implementation process, and particularly plug-in development based on a web browser is more difficult. 3) The HOOK technology has poor stability, more compatibility problems with system software and lower safety performance. 4) The current watermark embedding algorithm carries out watermark embedding by adopting a sequential redundancy mode aiming at each page of document data, which requires that the watermark information content has local integrity. Therefore, the watermark algorithm has poor performance in resisting malicious attacks such as clipping, rubbing, fouling, tearing and the like. In view of the above problems, the present invention provides a general text watermark solution.

Disclosure of Invention

The invention provides a method and a device for embedding and extracting a universal text watermark generated based on a dynamic word stock, which are used for solving the problems of poor watermark loading universality, poor system stability, complex implementation process, low watermark algorithm robustness and the like in the prior art on the premise of not changing any use habit of a user.

The idea of the invention is that first, a certain number of characters in a selected word stock are uniformly grouped according to a specific strategy, and all characters in each group represent the same watermark information bit string; performing deformation design on all characters in each group according to a specific rule, respectively obtaining a plurality of watermark character contour curve data corresponding to each character, and generating a watermark character data temporary file; generating watermark coding data of the user terminal according to a specific rule so as to identify the identity authentication information of the user terminal; according to the watermark coding data, dynamically generating a watermark font file through a watermark character data temporary file, wherein the watermark font file has the same attribute with a same-name font file installed in a system; loading the watermark font file in real time and replacing the same-name font file installed in the system; running a text file in an electronic format, and embedding watermark information in document content data of file printout and screen display in real time; and acquiring document picture data hidden with watermark information, extracting the watermark information, and tracing a document leakage source. Thus, a universal text watermark embedding and extracting method and device are obtained.

The invention discloses a universal text watermark method, which comprises a text watermark embedding and extracting method, wherein the text watermark embedding comprises the following steps:

step one, grouping a certain number of characters in a selected word stock according to a specific strategy;

step two, performing deformation design on all characters in each group according to a specific rule, and generating a watermark character data temporary file;

generating watermark encoding data of the user terminal to identify the identity authentication information of the user terminal;

step four, dynamically generating and loading a watermark font file in real time according to the watermark encoding data and combining the watermark character data temporary file and the grouped characters;

and step five, operating the text file in the electronic format, and embedding watermark information in the document content data printed out of the file and displayed on the screen in real time by using the watermark font file.

Further, the method also comprises a text watermark extraction step, namely a step six: and acquiring document picture data hidden with watermark information, extracting the watermark information, and tracing a document leakage source.

Preferably, the method for grouping a certain number of characters in the selected word stock comprises:

firstly, according to word frequency statistical resultSequencing common characters from high to low, and acquiring the first N characters to form a watermark character set

Next, the N characters are initially divided into M groups, which are denoted as { Ω ₁ ,Ω ₁ ,…,Ω _M In which M is<N, and is made arbitrary

0<i,j≤M，i≠j，

The specific grouping process is as follows:

step1. First M characters are selected according to the character frequency sequence

Are sequentially divided into { omega ₁ ,Ω ₁ ,…,Ω _M In each group, adding one character in each group;

step2. Select M characters again

In reverse order [ omega ] _M ,Ω _M-1 ,…,Ω ₁ Adding a character in each group in turn;

and Step3, repeatedly executing the Step1 and the Step2 until the grouping of the N characters is completed.

Then, randomly selecting a certain number of text training corpora, and fixing the number of characters of each text training corpus at t;

finally, all packets { Ω ] are counted ₁ ,Ω ₁ ,…,Ω _M Probability of characters in the text training corpus t is obtained, and according to probability distribution, grouping results are optimized to obtain final grouping { omega' ₁ ,Ω′ ₂ ,…,Ω′ _M }。

Preferably, the method of optimizing the grouping result according to the probability distribution to obtain the final grouping includes:

step1, calculate per packet Ω _i The probability of the characters in the text corpus that is more than or equal to 1 and less than or equal to M appears in t text corpora, and all the probabilities are sequenced from large to small;

step2, taking out the character with the minimum word frequency from the group with the highest probability and moving the character to the group with the lowest probability, taking out the character with the minimum word frequency from the group with the next highest probability and moving the character to the group with the next lowest probability, and repeating the process in sequence until all the movement is finished;

step3, repeating the

steps

1 and 2 again until the probability variance of all the packets reaches the minimum, thereby obtaining the final packet { omega' ₁ ,Ω′ ₂ ,…,Ω′ _M }。

Preferably, the performing of the deformation design on all the characters in each group means that the vector contour curve data of the characters is adjusted to obtain d different deformations, wherein d is greater than or equal to 2, and the different contour curve deformations represent different watermark information bit strings;

preferably, the watermark information bit string, all characters in each packet represent the same watermark information bit string.

Preferably, the generating of the temporary file of watermark character data refers to storing the designed and generated character deformation contour curve data in the temporary file together with character attribute description information, where the character attribute description information includes a width of a font frame, a height of the font frame, and an offset of each font in the font frame, where the offset may vary with different font structures.

Preferably, the generating of the user terminal watermark coding data includes the identity authentication information and the time information of the user terminal, and the specific generating method includes a manual designation and an automatic allocation mode.

Preferably, the automatic allocation method includes:

step1, creating a user terminal identity identification information recording table in a system background, wherein the information content comprises user ID, a user login account, a machine MAC address and machine IP address information, the user ID is automatically allocated and added by the background system, and the rest information is automatically submitted for a client monitoring program.

And step2, running a client monitoring program, automatically acquiring identity identification information of the user terminal, uploading the identity identification information to a system background, directly returning the user ID information when the uploaded identification information exists in a record table of a database of the system background, or adding a new record in the database of the system background, adding 1 to the user ID, and returning the user ID to the client.

And step3, after receiving the user ID information returned by the system background, the client monitoring process reads the system operation time in real time, and performs error correction coding processing on the user ID information and the time information to obtain the final user terminal watermark coding data.

Preferably, the error correction coding processing refers to that, for the watermark information bit string with the length Len, len is a multiple of 8, the watermark information error correction coding processing is performed in a parity check manner to obtain complete watermark coding data, and the specific process is as follows: arranging watermark information bit strings into a matrix of Len/8 rows and 8 columns; the parity check code for each column of information is calculated to form line Len/8+1, and the total length of the valid information code and check code is Len +8.

Preferably, the dynamically generating and loading the watermark font file in real time means opening a temporary watermark character data file, dynamically reading font outline curve data from the temporary watermark character data file according to a grouping strategy and user terminal watermark encoding data, and updating corresponding font structure data in a standard font installed in the system.

Preferably, the dynamically generating and loading the watermark word stock file in real time means that the loading of the watermark word stock is respectively completed according to the difference between Windows and Linux operating systems:

1) Windows environment.

Firstly, calling a system API function RemoveFontResource (PCTSTR lpFileName) to remove a standard font library installed by a system from a font table of the system; sending a WM _ FONTCHANG message to all top windows in the system to inform the change; and then, calling an AddFontResource (LPCTS lpszFilename) function to add the dynamically generated water-based font file resource to the system font table for normal use by the Windows application program.

2) A Linux environment.

The standard fonts installed by the system are uninstalled by deleting the font files from the corresponding folders. And when the global font is deleted, sending an fc-cache-fv command to update the font cache of the system. And copying the dynamically generated water lettering library file into a corresponding file directory, and sending an fc-cache-fv command to enable the system to know the change. When fc-cache is completed, all users on the system can access the newly added global fonts.

Preferably, the extracting of the watermark information mainly includes three steps:

1) According to the distribution of character groups in the document content, respectively extracting watermark information bit strings represented by all characters in each group;

2) Checking and correcting the preliminarily extracted watermark information bit string by using a watermark information bit string parity check rule;

3) Finally, splicing the watermark information bit strings extracted from all the groups to obtain the final complete watermark information bit string.

Based on the same inventive concept, the invention also provides a universal text watermarking device, which comprises:

a character grouping module: the system is responsible for grouping a certain number of characters in the selected word stock according to a specific strategy;

a font design module: the system is responsible for performing deformation design on all characters in each group processed by the character grouping module according to a specific rule and generating a watermark character data temporary file;

watermark coding generation module: the system is responsible for generating watermark coding data of the user terminal to identify the identity authentication information of the user terminal;

the word stock generation and loading module: according to the watermark encoding data generated by the watermark encoding generation module, combining the temporary watermark character data file generated by the character design module and the grouping character obtained by the character grouping module, and taking charge of dynamically generating and loading the watermark font file in real time;

a watermark embedding module: the system is responsible for operating a text file in an electronic format, and watermark information is embedded into a watermark word stock file generated and loaded by a word stock generating and loading module in document content data of file printout and screen display in real time;

the watermark extraction module: and the system is responsible for acquiring the document picture data which is obtained after the processing of the watermark embedding module and is hidden with the watermark information, extracting the watermark information and further tracing the document leakage source.

The invention has the following beneficial effects:

according to the invention, according to the unique watermark coding data information of the user terminal, the personalized watermark word stock is dynamically generated at the client, and the unique ID identification information is embedded in each watermark word stock. When the user identity information of different clients is different, the watermark information embedded in the watermark font library is also different. After the watermark font library is loaded in real time, all application software systems calling the local font library to carry out printing output and screen display embed watermark information in the file content in real time. Therefore, the text watermarking method has the advantages of strong universality, good compatibility, high stability and simple watermark information embedding process. In addition, due to the adoption of a packet unordered embedding strategy, the text watermarking method has higher robustness for resisting malicious attacks such as cutting, kneading, fouling, tearing and the like.

Drawings

Fig. 1 is a schematic flow chart of an implementation of a general text watermark embedding and extracting method described in the embodiment;

FIG. 2 is a schematic diagram of an arcuate grouping method;

FIG. 3 is a diagram of a temporary file structure of watermark character data;

fig. 4 is a schematic diagram of error correction encoding of watermark information in the method according to the embodiment;

FIG. 5 is a diagram illustrating a process of dynamically generating a watermark font library;

fig. 6 is a schematic structural diagram of a device for embedding and extracting a general text watermark in an embodiment.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, the present invention shall be described in further detail with reference to the following detailed description and accompanying drawings.

Fig. 1 is a schematic flow chart of an implementation of a general text watermark embedding and extracting method in the embodiment.

S101, grouping a certain number of characters in the selected word stock according to a specific strategy.

In the invention, in order to realize universal watermark information embedding, a unique personalized watermark word stock is dynamically generated at each client, namely, the watermark word stocks generated by different clients are different (which is different from any existing technical scheme, namely, the watermark word stocks generated and installed by each client in the existing technical scheme are the same), and corresponding user terminal identity authentication information is embedded in each watermark word stock. After the watermark word stock is generated, the watermark word stock is dynamically loaded in real time and replaces a standard word stock which is pre-installed in an operating system, and when all application software systems calling watermark fonts of a client side perform screen display and printout operations of an electronic document, watermark information is automatically embedded into the document content in real time. Therefore, the watermarking method has strong universality, simple implementation process and good compatibility with a system and other application software. But in order to ensure sufficient information capacity and watermark extraction efficiency, we represent a specific watermark bit string by a group of characters, i.e. all characters in the group represent the same watermark information bit string. When at least one character of each packet is present in the text content, the corresponding bit string of watermark information can be correctly extracted. Therefore, a certain number of characters in the selected word stock are reasonably grouped according to a specific strategy, so that the probability that the characters in each group appear in the text content is relatively high.

The specific grouping method comprises the following steps:

firstly, sequencing common characters from high to low according to word frequency statistical results, and acquiring top N characters to form a watermark character set

In the present embodiment, N =2000.

Next, the N characters are initially divided into M groups { Ω ₁ ,Ω ₁ ,…,Ω _M In which M is<N, and is made arbitrary

0<i,j≤M,i≠j，

In the present embodiment, M =28.

The specific grouping process is as follows:

Are sequentially divided into { omega ₁ ,Ω ₁ ,…,Ω _M In each group, a character is added in each group.

Step2. Select M characters again

In reverse order { Ω _M ,Ω _M-1 ,…,Ω ₁ One character is added to each packet in turn.

And Step3, repeatedly executing Step1 and Step2 until the N characters are grouped.

The M groups [ omega ] are preliminarily obtained by the arch grouping method shown in FIG. 2 ₁ ,Ω ₁ ,…,Ω _M For example, the character set in the first group is:

then, a certain number of text corpora are randomly selected, and the number of characters of each text corpora is fixed to t, where t =200 in this embodiment. In order to verify the probability value of each group of characters appearing in common text documents, a large number of samples need to be collected for training test. Therefore, nearly 50 million articles are downloaded by means of internet crawlers, wherein the articles cover the fields of politics, military affairs, news, sports, culture, history, finance and the like. And (3) storing each collected article as a text training corpus of 200 characters after content filtering and clipping operations.

Because the grouping operation is performed only based on the word frequency sorting result in the method, the situation of uneven probability distribution may occur in the actual text corpus training process, and therefore specific optimization operation is required to obtain more balanced grouping. The specific grouping optimization method comprises the following steps:

step1, calculate per packet Ω _i The probability of the characters in the training corpus, i is more than or equal to 1 and less than or equal to M, is obtained, and all the probabilities are sequenced according to the sequence from large to small;

step3, repeating the Step1 and the Step2 again until the probability variance of all the packets reaches the minimum, thereby obtaining the optimal packet [ omega' ₁ ,Ω′ ₂ ,…,Ω′ _M }。

S102, performing deformation design on all characters in each group according to a specific rule, and generating a watermark character data temporary file.

The character deformation design characteristic refers to that vector outline curve data of a character is adjusted to obtain d different deformations, wherein d is larger than or equal to 2, and the different outline curve deformations represent different watermark information bit strings. In order to effectively increase the watermark information capacity, the number of character variations usually exceeds 2, and in the present embodiment, d =4. I.e. 4 different variants per character, represents a 2-bit watermark information bit string. In addition, it should be noted that all characters in each packet represent the same watermark information bit string.

And storing the character deformation contour curve data generated by the design in a temporary file together with character attribute description information, wherein the character attribute description information comprises the width of a font frame, the height of the font frame and the offset of each font in the font frame, and the offset can be changed along with the difference of font structures. In the temporary file, a storage structure of each font data is as shown in fig. 3. The specific information is described as follows:

UNICODE encoding of characters: assigning a unique UNICODE code to the character represented by the glyph structure in the temporary file;

horizontal layout: the width of the font outer frame and the distance from the leftmost point of the font contour line to the font left frame are included;

vertical layout: the height of the font outer frame and the distance from the topmost point of the font contour line to the font upper frame are included;

size of primitive data: the capacity of the vector outline curve data of the character pattern structure is represented, and the unit is byte;

primitive data: an array of BYTE types stores vector outline curve data of a specific glyph structure, and also includes the definition of a grid and associated instruction data.

S103, generating watermark coding data of the user terminal to identify the identity authentication information of the user terminal.

In order to effectively track and trace a client-side divulgence event, unique watermark coding data information needs to be generated for a user terminal, the content includes identity authentication information and time information of the user terminal, and a specific generation method includes a manual designation mode and an automatic distribution mode. The manual designation mode refers to that when the client system is installed and deployed, unique encoding information is manually designated for the client. The method for automatically distributing the watermark coding data characteristics comprises the following steps:

step1, creating a user terminal identity identification information recording table in a system background, wherein the information content comprises user ID, a user login account, a machine MAC address and machine IP address information, the user ID is automatically distributed and added by the background system, and the rest information is automatically submitted for a client monitoring program;

Aiming at a watermark information bit string with the length Len, len is a multiple of 8, and a parity check mode is adopted to carry out error correction coding processing on the watermark information to obtain complete watermark coding data, and the specific process is as follows: arranging watermark information bit strings into a matrix of Len/8 rows and 8 columns; the parity check code for each column of information is calculated to form line Len/8+1, and the total length of the valid information code and check code is Len +8. In this embodiment, let =48, the watermark information bit string is arranged as shown in fig. 4, the first 6 rows are valid watermark information bit strings, and the information in each column of the last row (the gray part in the figure) is the parity of the information bit string in the first 6 rows of the column where the information is located, so that the final watermark encoding data amount is 56 bits. Whereas in the present embodiment 28 packets are selected, each representing 2 bits of watermark information, exactly 56 bits of watermark encoding data can be embedded.

And S104, dynamically generating and loading the watermark font file in real time according to the watermark coding data.

And the process of dynamically generating the watermark font library refers to opening a temporary file of the watermark character data, dynamically reading font outline curve data from the temporary file according to a grouping strategy and watermark coding information, and updating corresponding font structure data in a standard font library installed in the system. The specific process is shown in fig. 5.

Step1, firstly, analyzing key data of a system installation standard word stock file, wherein the key data comprises a font information header, a maximum requirement table, a name table, a font coding mapping table, a primitive position index table, a font horizontal layout degree table, a font vertical layout degree table, primitive data and the like;

step2, reading a temporary file containing watermark character contour curve data;

step3, initializing an empty font coding mapping table M, a primitive position index table S, a horizontal layout degree table H and a vertical layout degree table V, and generating an empty primitive data file tmp for storing the primitive data information of the vector;

step4, writing each character data in sequence, and setting the data quantity of the primitive data corresponding to the ith character currently processed and all the i-1 characters processed previously as dwS _i-1 Then the value S of the i-1 th item in the primitive position index table _i-1 ＝dwS _i-1 At this time, the ith character processing procedure is as follows:

1) Updating a font code mapping table M according to the UNICODE codes represented by the characters in the font library;

2) Reading the horizontal and vertical layout information of the character and the data volume dw of the new primitive from the temporary file _N And vector outline data of the new primitive, and update to the corresponding table of the target watermark word stock;

3) Updating a primitive position index table of the target watermark font file as follows: s _i ＝S _i-1 +dw _N ；

4) And saving the primitive data read from the temporary file into a primitive data file tmp.

And step5, writing the font file header information and the related attribute value information, the font coding mapping table, the primitive position index table and the horizontal and vertical layout degree table in sequence according to the structure of the font file, and finally writing all the primitive data stored in the primitive data file tmp into a primitive data area of a newly generated font library to generate a new watermark font library file.

As described above, after the watermark font library is generated according to the watermark encoding information, the font resource table of the operating system needs to be updated to complete the correct call of the application software system to the watermark font library. According to different operating systems, the loading process of the watermark word stock is divided into the following two cases:

1) A Windows environment.

Firstly, calling a system API function RemoveFontResource (PCTSTR lpFileName) to remove a standard word library installed by the system from a system word table; sending a WM _ FONTCHANGG message to all top windows in the system to inform the change; and then, calling an AddFontResource (LPCTSTRTlpszFilename) function to add the dynamically generated water-printed library file resource to the system font table for normal use by the Windows application program.

2) A Linux environment.

The standard fonts installed by the system are uninstalled by deleting the font files from the corresponding folders. And when the global font is deleted, sending an fc-cache-fv command to update the font cache of the system. And copying the dynamically generated water lettering library file into a corresponding file directory, and sending an fc-cache-fv command to enable the system to know the change. When the fc-cache is completed, all users on the system can access the newly added global fonts.

And S105, running the text file in the electronic format, and embedding watermark information in the document content data of the file printout and the screen display in real time.

As described above, in the present invention, the watermark font file is dynamically generated and updated to the font table of the operating system to replace the standard font file installed by the operating system, so that the application program of the system automatically calls the newly loaded watermark font file, thereby completing the real-time embedding of the watermark information.

S106, obtaining the document picture data with the hidden watermark information, extracting the watermark information, and tracing the document divulgence source.

The watermark information extraction process mainly comprises three steps:

3) And finally, splicing all the information strings extracted by the groups to obtain a final complete watermark information bit string.

It should be noted that when the same character appears at different positions in the document content, the represented watermark information bit string is the same, and the watermark information represented by all characters that appear in the document content and belong to the same group is also the same. In the extraction of watermark information, in order to consider the efficiency of processing, a "multiple" threshold p for watermark character extraction in each packet is defined, that is, watermark information is extracted at most p times in each packet. When the number of occurrences of the same character in the packet exceeds p, or the number of characters contained exceeds p, the watermark extraction process is only run p times. Otherwise, the watermark extraction operation will be performed for all the appearing characters.

As shown in fig. 6, based on the same inventive concept, the present invention further provides a general text watermark embedding and extracting apparatus, including:

character grouping module 1: the system is responsible for grouping a certain number of characters in the selected word stock according to a specific strategy;

font design module 2: the system is responsible for performing deformation design on all characters in each group processed by the character grouping module according to a specific rule and generating a watermark character data temporary file;

watermark coding generation module 3: the system is responsible for generating watermark coding data of the user terminal to identify the identity authentication information of the user terminal;

the word stock generation and loading module 4: according to the watermark encoding data generated by the watermark encoding generation module, combining the watermark character data temporary file generated by the font design module and the grouping character obtained by the character grouping module processing, the watermark character library file is responsible for dynamically generating and loading the watermark character library file in real time;

the watermark embedding module 5: the system is in charge of operating a text file in an electronic format, and embedding watermark information into a watermark font file generated and loaded by a font generation and loading module in real time when the file is printed out or the content of the file is displayed on a screen;

the watermark extraction module 6: and the system is responsible for acquiring the document picture data which is obtained after the processing of the watermark embedding module and is hidden with the watermark information, extracting the watermark information and further tracing the document leakage source.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

For example, in order to prevent the dynamically generated watermark font library from being maliciously tampered, or the watermark font library from being damaged by reinstallation of a related font by a user, a watermark font library detection service program is deployed and installed on the client. And the detection service program calculates the SHA1 value of the newly generated water printing library file and records the value in a system ledger. The detection service program scans the loaded watermark font library file at regular time, calculates the SHA1 value of the current watermark font library file, and compares the value with the value recorded in the system ledger. If the two types of the files are different, the watermark word stock file is destroyed, and the watermark word stock generating and loading processes are executed again.

For example, the generated watermark encoding data of the user terminal includes the identity authentication information and the time information of the user terminal. In order to accurately record different time information every day, a watermark font library generation time information monitoring program can be deployed and installed on the client side. When the operating system is restarted, the monitoring program executes the dynamic generation and real-time loading work of the watermark font library and records the effective time of the current watermark font library. And during the normal operation of the operating system, the monitoring program regularly detects the effective time of the watermark font library. And if the current time is not in the same day as the effective time, the monitoring program re-executes the dynamic generation and real-time loading work, and updates the effective time of the watermark font library again.

For example, in order to improve the dynamic generation efficiency of the watermark word stock of the user terminal, a watermark word stock generation time information monitoring program is deployed and installed at the client, and the corresponding watermark word stock is generated in advance according to different time periods. When the operating system is restarted, the monitoring program selects the corresponding watermark font library file according to the current time period of the system, executes the copy and real-time loading work of the watermark font library file, and then deletes the used overdue watermark font library file.

For example, in the character grouping optimization process, the grouping strategy can be further optimized in a word splitting mode. For example, the characters "aim" and "are present simultaneously in the first packet, while the word" aim "is a high frequency word, i.e. in normal text content the probability of the simultaneous presence of" aim "and" is relatively high. Therefore, the 'target' words with relatively low word frequency can be moved to the packets with lower probability in the current optimization stage, so that the more probability-equalized packet results can be obtained.

For example, in the method for generating a watermark font library according to embodiment 1, only one of the contour curve deformation data of each character is updated to the standard character encoding area, so that the system application software can be normally used without any problems such as messy code display. When the contour curve data of all character deformations of each character is copied to the extended coding area of the word stock, and a unique UNICODE code is respectively allocated to each character deformation, more watermark information can be embedded in the text content with fewer characters by dynamically replacing the character codes in the process of document printout or outgoing management, thereby improving the watermark information capacity.

Claims

1. A general text watermarking method, comprising the steps of:

grouping a certain number of characters in the selected word stock according to a specific strategy;

performing deformation design on all characters in each group according to a specific rule, and generating a watermark character data temporary file;

generating watermark coding data of the user terminal to identify the identity authentication information of the user terminal;

dynamically generating and loading a watermark font file in real time according to the watermark encoding data and by combining the watermark character data temporary file and the grouped characters;

running a text file in an electronic format, and embedding watermark information in document content data printed out of the file and displayed on a screen in real time by using a watermark font file;

the grouping of a certain number of characters in the selected word stock comprises:

firstly, sequencing common characters from high to low according to word frequency statistical results, and acquiring first N characters to form a watermark character set

Next, the N characters are preliminarily divided into M groups, which are marked as { omega ₁ ,Ω ₁ ,…,Ω _M In which M is<N, and make arbitrary

0<i,j≤M，i≠j，

The specific grouping process comprises the following steps:

Are sequentially divided into { omega ₁ ,Ω ₁ ,…,Ω _M In each group, adding a character in each group;

step2. Select M characters again

step3, repeatedly executing Step1 and Step2 until the grouping of the N characters is finished;

2. The method of claim 1, further comprising the steps of:

and acquiring document picture data hidden with watermark information, extracting the watermark information, and tracing a document leakage source.

3. The method according to claim 1 or 2, wherein the optimizing the grouping result according to the probability distribution to obtain the final grouping comprises:

step1, calculate per packet Ω _i The probability of the characters in the M is more than or equal to 1 and less than or equal to M appearing in t text training corpora, and all the probabilities are sequenced from large to small;

step2, extracting the character with the minimum word frequency from the group with the highest probability, moving the character with the minimum word frequency to the group with the lowest probability, and repeating the process in sequence until the movement is completely finished;

step3, repeating the steps 1 and 2 again until the probability variance of all the packets reaches the minimum, thereby obtaining the final packet { omega' ₁ ,Ω′ ₂ ,…,Ω′ _M }。

4. The method according to claim 1 or 2, wherein the performing deformation design on all the characters in each group means that vector contour curve data of the characters are adjusted to obtain d different deformations, wherein d is larger than or equal to 2, and the different contour curve deformations represent different watermark information bit strings; all characters in each packet represent the same watermark information bit string; the generation of the temporary file of the watermark character data refers to the fact that character deformation contour curve data generated by design and character attribute description information are stored in the temporary file together, wherein the character attribute description information comprises the width of a font frame, the height of the font frame and the offset of each font in the font frame, and the offset can be changed along with the difference of font structures.

5. The method according to claim 1 or 2, characterized in that the user terminal watermark encoding data comprises the identity authentication information and the time information of the user terminal, and the user terminal watermark encoding data is generated by adopting a manual designation or automatic distribution mode; the automatic allocation mode comprises the following steps:

step1, creating a user terminal identity identification information recording table in a system background, wherein the information content comprises user ID, a user login account, a machine MAC address and machine IP address information, the user ID is automatically allocated and added by the background system, and the rest information is automatically submitted for a client monitoring program;

step2, operating a client monitoring program, automatically acquiring identity identification information of the user terminal and uploading the identity identification information to a system background, and directly returning user ID information when the uploaded identification information exists in a record table of a system background database, or else, adding a new record in the system background database, adding 1 to the user ID and returning the user ID to the client;

6. The method according to claim 5, wherein the error correction coding processing refers to performing error correction coding processing on the watermark information by using a parity check method to obtain complete watermark coding data for a watermark information bit string with a length Len, len being a multiple of 8, and the specific process is as follows: arranging watermark information bit strings into a matrix of Len/8 rows and 8 columns; the parity check code for each column of information is calculated to form line Len/8+1, and the total length of the valid information code and check code is Len +8.

7. The method according to claim 1 or 2, wherein the dynamically generating and loading the watermark font file in real time means opening a temporary watermark character data file, dynamically reading font outline curve data from the temporary watermark character data file according to a grouping strategy and user terminal watermark encoding data, and updating corresponding font structure data in a standard font installed in the system; the real-time loading of the watermark word stock file refers to the fact that the watermark word stock is loaded respectively according to different Windows and Linux operating systems.

8. The method of claim 2, wherein the extracting watermark information comprises:

according to the distribution of character groups in the document content, respectively extracting watermark information bit strings represented by all characters in each group;

checking and correcting the preliminarily extracted watermark information bit string by using a watermark information bit string parity check rule;

finally, splicing all watermark information bit strings extracted from the groups to obtain a final complete watermark information bit string.

9. A universal text watermarking apparatus that employs the method of any one of claims 1 to 8, comprising:

the character grouping module is responsible for grouping a certain number of characters in the selected word stock according to a specific strategy;

the character pattern design module is responsible for performing deformation design on all characters in each group processed by the character grouping module according to a specific rule and generating a watermark character data temporary file;

the watermark coding generation module is responsible for generating watermark coding data of the user terminal and is used for identifying the identity authentication information of the user terminal;

the font library generating and loading module is responsible for dynamically generating and loading the watermark font library file in real time according to the watermark encoding data generated by the watermark encoding generating module, and by combining the watermark character data temporary file generated by the font design module and the grouped characters obtained by the character grouping module;

the watermark embedding module is responsible for operating the text file in the electronic format, and embedding watermark information in real time by using the watermark word stock file generated and loaded by the word stock generating and loading module when the file is printed out or the content of the file is displayed on a screen;

and the watermark extraction module is in charge of acquiring the document picture data which is obtained by processing of the watermark embedding module and is hidden with the watermark information, extracting the watermark information and further tracing a document divulgence source.