CN112966679A - Information tracing method and system based on minimum character connected domain deviation - Google Patents

Information tracing method and system based on minimum character connected domain deviation Download PDF

Info

Publication number
CN112966679A
CN112966679A CN202110280692.XA CN202110280692A CN112966679A CN 112966679 A CN112966679 A CN 112966679A CN 202110280692 A CN202110280692 A CN 202110280692A CN 112966679 A CN112966679 A CN 112966679A
Authority
CN
China
Prior art keywords
character
text
connected domain
available
offset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110280692.XA
Other languages
Chinese (zh)
Inventor
方俊
祝玉鹏
吕文晋
陶冶
孙鑫凯
史祎诗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuxin Kunpeng Beijing Information Technology Co ltd
University of Chinese Academy of Sciences
Original Assignee
Fuxin Kunpeng Beijing Information Technology Co ltd
University of Chinese Academy of Sciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuxin Kunpeng Beijing Information Technology Co ltd, University of Chinese Academy of Sciences filed Critical Fuxin Kunpeng Beijing Information Technology Co ltd
Priority to CN202110280692.XA priority Critical patent/CN112966679A/en
Publication of CN112966679A publication Critical patent/CN112966679A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/0021Image watermarking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/28Quantising the image, e.g. histogram thresholding for discrimination between background and foreground patterns
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2201/00General purpose image data processing
    • G06T2201/005Image watermarking
    • G06T2201/0065Extraction of an embedded watermark; Reliable detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Editing Of Facsimile Originals (AREA)

Abstract

The invention relates to a tracing method and a tracing system based on character minimum connected domain deviation, wherein the method comprises the following steps: acquiring a first text character; determining a first available character in the first text character; embedding the embedding information into the minimum connected domain of the first available character according to the offset; the offset is the distance between the embedded information and the first available character; acquiring a second text character; performing frame selection on a second available character in the second text character to obtain a minimum connected domain of the second available character; extracting the maximum connected domain from the minimum connected domain of the second available character and performing row-column projection to obtain an offset; and extracting the embedded information according to a preset threshold and the offset. After the text is printed and scanned, the loss of pixel points at the edge of the character, the line spacing of the document and the word spacing change, but the connected domain in the character does not change.

Description

Information tracing method and system based on minimum character connected domain deviation
Technical Field
The invention relates to the technical field of information tracing, in particular to an information tracing method and system based on minimum connected domain deviation of characters.
Background
Due to the explosion of the mobile internet, more and more documents are transmitted in digital media. The application of the electronic documents on the network not only greatly facilitates the life of people and improves the office speed, but also brings convenience to illegal attackers, and meanwhile, many security department personnel also have the risk of secret disclosure. The name and date information of the operator is embedded in the electronic file, so that the safety problems of copyright, source tracing after the leakage of confidential files and the like can be effectively solved.
After embedding traceability information in an electronic document, the biggest problem is that traceability information is lost or an error is generated when information is extracted as the document is printed for many times. The existing print and scan resistant document tracing technology is roughly divided into three types, namely text image based algorithm, text format based algorithm and text content based algorithm. The first method is mainly to change the edge pixel points of the cut characters to embed information, and the robustness of the document printed and scanned for many times is poor; the second method is to change the line, column spacing or file format of text characters to hide information, and if zooming of images or changing of file formats is performed, information extraction errors occur; the third method is to change the content of the text in the form of embedding and hiding information by replacing synonyms, but the content of many files cannot be modified.
Therefore, how to solve the problem of source tracing information extraction failure or extraction error caused by the missing of character edge pixel points and the change of document line spacing and character spacing after multiple printing and scanning is a main research direction at present.
Disclosure of Invention
The invention aims to provide a traceability method and a traceability system based on minimum connected domain offset of characters, which are used for solving the problems of traceability information extraction failure or extraction error caused by the loss of character edge pixel points and the change of document line spacing and character spacing after multiple printing and scanning, and improving the traceability printing and scanning resistance.
In order to achieve the purpose, the invention provides the following scheme:
an information tracing method based on character minimum connected domain offset comprises the following steps:
performing character segmentation on the text image to obtain a first text character;
determining a first available character in the first text character according to the connected domain of the first text character;
embedding the embedded information into the minimum connected domain of the first available character according to the offset to obtain a text with the embedded information; the offset is a distance between the embedded information and the first available character;
performing character segmentation on the printed or scanned part of the text embedded with the information to obtain a second text character;
performing frame selection on a second available character in the second text character according to the minimum connected domain of the first available character to obtain the minimum connected domain of the second available character;
extracting the maximum connected domain from the minimum connected domain of the second available character and performing row-column projection to obtain the offset;
and extracting the embedded information according to a preset threshold and the offset.
Optionally, the determining a first available character in the first text character according to the connected component of the first text character specifically includes:
and removing characters with the minimum connected components at the edge and the number of the connected components equal to 1 in the first text character, and determining a first available character in the first text character.
Optionally, the offset is set according to a resolution size of the text image.
Optionally, the method further comprises:
and determining a second available character in the second text character according to the connected domain of the second text character.
Optionally, the extracting the embedded information according to the preset threshold and the offset specifically includes:
judging the magnitude between the offset and a preset threshold, wherein the judgment result is represented by 0 and 1;
and extracting the embedded information according to the judgment result.
An information tracing system based on character minimum connected domain offset, comprising:
the first segmentation module is used for performing character segmentation on the text image to obtain a first text character;
the determining module is used for determining a first available character in the first text character according to the connected domain of the first text character;
the embedding module is used for embedding the embedded information into the minimum connected domain of the first available character according to the offset to obtain a text after the information is embedded; the offset is a distance between the embedded information and the first available character;
the second segmentation module is used for performing character segmentation on the printed or scanned piece of the text embedded with the information to obtain a second text character;
the frame selection module is used for carrying out frame selection on a second available character in the second text character according to the minimum connected domain of the first available character to obtain the minimum connected domain of the second available character;
the first extraction module is used for extracting the maximum connected domain from the minimum connected domain of the second available character and performing row-column projection to obtain the offset;
and the second extraction module is used for extracting the embedded information according to a preset threshold and the offset.
Optionally, the determining module specifically includes:
and the removing unit is used for removing the characters with the minimum connected domain at the edge and the number of the connected domains equal to 1 in the first text character and determining the first available character in the first text character.
Optionally, the offset is set according to a resolution size of the text image.
Optionally, the method further comprises:
and the second available character determining module is used for determining a second available character in the second text character according to the connected domain of the second text character.
Optionally, the second extraction module specifically includes:
the judging unit is used for judging the size between the offset and a preset threshold, and judging results are represented by 0 and 1;
and the extraction unit is used for extracting the embedded information according to the judgment result.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
the invention discloses a tracing method and a tracing system based on minimum connected domain deviation of characters, wherein the method comprises the following steps: performing character segmentation on the text image to obtain a first text character; determining a first available character in the first text character according to the connected domain of the first text character; embedding the embedded information into the minimum connected domain of the first available character according to the offset to obtain a text with the embedded information; the offset is a distance between the embedded information and the first available character; performing character segmentation on the printed or scanned part of the text embedded with the information to obtain a second text character; performing frame selection on a second available character in the second text character according to the minimum connected domain of the first available character to obtain the minimum connected domain of the second available character; extracting the maximum connected domain from the minimum connected domain of the second available character and performing row-column projection to obtain the offset; and extracting the embedded information according to a preset threshold and the offset. After the text is printed and scanned, the loss of pixel points at the edge of the character, the line spacing of the document and the word spacing change, but the connected domain in the character does not change.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a flowchart of a tracing method based on minimum connected component domain offset of characters according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of information embedding provided by an embodiment of the present invention;
FIG. 3(a) is an image before embedding information according to an embodiment of the present invention; FIG. 3(b) is an image with embedded information according to an embodiment of the present invention;
fig. 4 is a schematic diagram of extracting embedded information according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a traceability method and a traceability system based on minimum connected domain offset of characters, which are used for solving the problems of traceability information extraction failure or extraction error caused by the loss of character edge pixel points and the change of document line spacing and character spacing after multiple printing and scanning, and improving the traceability printing and scanning resistance.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
As shown in fig. 1, the information tracing method based on the minimum connected component offset of characters includes:
step 101: and performing character segmentation on the text image to obtain a first text character.
Specifically, firstly, the text image is subjected to binarization processing, and then character segmentation is performed to obtain a first body character I1,Ι2...ΙNAnd saving to a specific folder. Wherein I1Is the first character, I, in the first text character2Is the second character of the first text character, and so on, INThe nth character of the first text character.
Step 102: and determining a first available character in the first text character according to the connected domain of the first text character.
Specifically, the characters with the minimum connected component at the edge and the number of connected components equal to 1 in the first text character are removed, and the first available character in the first text character is determined. First available characters II1,Ⅱ2...ⅡNIs shown in which II1For the first character of the first available characters, II2The second character of the first available characters, and so on, IINThe nth character of the first available characters.
Step 103: embedding the embedded information into the minimum connected domain of the first available character according to the offset to obtain a text with the embedded information; the offset is a distance between the embedded information and the first available character.
Furthermore, in order to enable the invention to have a better hiding effect, the size of the text image can be read in advance, the resolution of the text image can be obtained, and the offset is set according to the resolution of the text image.
The above steps can realize information embedding, the schematic diagram is shown in figure 2,
fig. 3(a) is an image before information is embedded according to an embodiment of the present invention, and fig. 3(b) is an image after information is embedded according to an embodiment of the present invention. As shown in fig. 3(a) and 3(b), the two images are changed before and after information embedding, and are only imperceptible to human eyes, which shows that the text embedded with information of the present invention keeps the visual effect unchanged, and the watermark (embedded information) is imperceptible.
Step 104: and performing character segmentation on the printed or scanned text with the embedded information to obtain a second text character.
Specifically, binarization processing is carried out on a printed or scanned piece of the text embedded with the information, and then character segmentation is carried out to obtain a second text character ch1,ch2..chN. Wherein ch1For the first of the second text characters, ch2The second of the second text characters, and so on, chNThe nth character of the first text character.
Step 105: and performing frame selection on a second available character in the second text character according to the minimum connected domain of the first available character to obtain the minimum connected domain of the second available character.
In this step, a second available character in the second text character is also determined based on the connected component of the second text character. That is, the characters with the minimum connected component at the edge and the number of connected components equal to 1 in the second text character are removed, and the second available character in the second text character is determined. Cm for the second available character1,cm2..cmNIs represented by, wherein cm1Is the first character of the second available characters, cm2The second character of the second available characters, and so on, cmNThe nth character of the second available characters.
And then obtaining coordinate parameters (col, row, width, height) of the minimum connected domain of the second available character, wherein col is an abscissa value of an upper left point of the minimum connected domain, row is an ordinate value of the upper left point of the minimum connected domain, width is the width of the minimum connected domain, and height is the height of the minimum connected domain. And performing frame selection on the second available character by using the coordinate parameter of the minimum connected domain to obtain the minimum connected domain of the second available character.
Step 106: and extracting the maximum connected domain from the minimum connected domain of the second available character and performing row-column projection to obtain the offset.
Step 107: and extracting the embedded information according to a preset threshold and the offset. The method specifically comprises the following steps:
judging the magnitude between the offset and a preset threshold, wherein the judgment result is represented by 0 and 1;
and extracting the embedded information according to the judgment result.
That is, if the offset distance is greater than the threshold value, 1 is returned, and if the offset distance is less than the threshold value, 0 is returned, the sequences of 0 and 1 are obtained, and the sequences of 0 and 1 are converted into hidden character information through Unicode coding. Fig. 4 is a schematic diagram of extracting embedded information according to an embodiment of the present invention.
The present embodiment further provides an information tracing system based on minimum connected domain offset of characters, including:
and the first segmentation module is used for performing character segmentation on the text image to obtain a first text character.
And the determining module is used for determining a first available character in the first text character according to the connected domain of the first text character. Wherein, the determining module specifically comprises:
and the removing unit is used for removing the characters with the minimum connected domain at the edge and the number of the connected domains equal to 1 in the first text character and determining the first available character in the first text character.
The embedding module is used for embedding the embedded information into the minimum connected domain of the first available character according to the offset to obtain a text after the information is embedded; the offset is a distance between the embedded information and the first available character. And setting the offset according to the resolution of the text image.
And the second segmentation module is used for performing character segmentation on the printed or scanned piece of the text embedded with the information to obtain a second text character.
And the frame selection module is used for carrying out frame selection on a second available character in the second text character according to the minimum connected domain of the first available character to obtain the minimum connected domain of the second available character.
And the first extraction module is used for extracting the maximum connected domain from the minimum connected domain of the second available character and performing row-column projection to obtain the offset.
And the second extraction module is used for extracting the embedded information according to a preset threshold and the offset. Wherein, the second extraction module specifically comprises:
the judging unit is used for judging the size between the offset and a preset threshold, and judging results are represented by 0 and 1;
and the extraction unit is used for extracting the embedded information according to the judgment result.
In this embodiment, the information tracing system based on the minimum connected component offset of characters further includes:
and the second available character determining module is used for determining a second available character in the second text character according to the connected domain of the second text character.
The main performance indexes of the document text tracing method are invisibility, robustness, hidden capacity and the like. According to the method, the offset of the minimum connected domain in the character is different according to the difference of the resolution of the image, the text image with 600dpi is tested, and the invisibility of the embedded information text image with the offset of 6 pixel points is good.
In the invention, under the condition of one-time printing and scanning, the source tracing success rate of the text image of the embedded information is 100 percent; under the condition of two times of printing and scanning, the tracing success rate of the text image of the embedded information is 90 percent; under the condition of three times of printing and scanning, the tracing success rate of the text image with the embedded information is 80%, and the robustness is good.
Because of the character difference of the text content, the hidden capacity of each system is slightly deviated, and the test is carried out aiming at a large number of document text images, the invention can embed 40-50bit of information quantity in 100 Chinese characters, and the hidden capacity is high.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (10)

1. An information tracing method based on character minimum connected domain offset is characterized by comprising the following steps:
performing character segmentation on the text image to obtain a first text character;
determining a first available character in the first text character according to the connected domain of the first text character;
embedding the embedded information into the minimum connected domain of the first available character according to the offset to obtain a text with the embedded information; the offset is a distance between the embedded information and the first available character;
performing character segmentation on the printed or scanned part of the text embedded with the information to obtain a second text character;
performing frame selection on a second available character in the second text character according to the minimum connected domain of the first available character to obtain the minimum connected domain of the second available character;
extracting the maximum connected domain from the minimum connected domain of the second available character and performing row-column projection to obtain the offset;
and extracting the embedded information according to a preset threshold and the offset.
2. The information tracing method based on the minimum character connected component offset according to claim 1, wherein the determining a first available character in the first text character according to the connected component of the first text character specifically includes:
and removing characters with the minimum connected components at the edge and the number of the connected components equal to 1 in the first text character, and determining a first available character in the first text character.
3. The information tracing method based on character minimum connected component deviation according to claim 1, wherein the deviation amount is set according to the resolution of the text image.
4. The information tracing method based on character minimum connected component offset according to claim 1, further comprising:
and determining a second available character in the second text character according to the connected domain of the second text character.
5. The information tracing method based on the minimum connected component deviation of characters according to claim 1, wherein the extracting of the embedded information according to the preset threshold and the deviation specifically includes:
judging the magnitude between the offset and a preset threshold, wherein the judgment result is represented by 0 and 1;
and extracting the embedded information according to the judgment result.
6. An information tracing system based on character minimum connected domain offset, comprising:
the first segmentation module is used for performing character segmentation on the text image to obtain a first text character;
the determining module is used for determining a first available character in the first text character according to the connected domain of the first text character;
the embedding module is used for embedding the embedded information into the minimum connected domain of the first available character according to the offset to obtain a text after the information is embedded; the offset is a distance between the embedded information and the first available character;
the second segmentation module is used for performing character segmentation on the printed or scanned piece of the text embedded with the information to obtain a second text character;
the frame selection module is used for carrying out frame selection on a second available character in the second text character according to the minimum connected domain of the first available character to obtain the minimum connected domain of the second available character;
the first extraction module is used for extracting the maximum connected domain from the minimum connected domain of the second available character and performing row-column projection to obtain the offset;
and the second extraction module is used for extracting the embedded information according to a preset threshold and the offset.
7. The system of claim 6, wherein the determining module specifically comprises:
and the removing unit is used for removing the characters with the minimum connected domain at the edge and the number of the connected domains equal to 1 in the first text character and determining the first available character in the first text character.
8. The system of claim 6, wherein the offset is set according to a resolution of the text image.
9. The system of claim 6, further comprising:
and the second available character determining module is used for determining a second available character in the second text character according to the connected domain of the second text character.
10. The system of claim 6, wherein the second extraction module specifically comprises:
the judging unit is used for judging the size between the offset and a preset threshold, and judging results are represented by 0 and 1;
and the extraction unit is used for extracting the embedded information according to the judgment result.
CN202110280692.XA 2021-03-16 2021-03-16 Information tracing method and system based on minimum character connected domain deviation Pending CN112966679A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110280692.XA CN112966679A (en) 2021-03-16 2021-03-16 Information tracing method and system based on minimum character connected domain deviation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110280692.XA CN112966679A (en) 2021-03-16 2021-03-16 Information tracing method and system based on minimum character connected domain deviation

Publications (1)

Publication Number Publication Date
CN112966679A true CN112966679A (en) 2021-06-15

Family

ID=76277759

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110280692.XA Pending CN112966679A (en) 2021-03-16 2021-03-16 Information tracing method and system based on minimum character connected domain deviation

Country Status (1)

Country Link
CN (1) CN112966679A (en)

Similar Documents

Publication Publication Date Title
Brassil et al. Hiding information in document images
CN107248134B (en) Method and device for hiding information in text document
EP1953752B1 (en) Embedding and detecting hidden information
JP3136061B2 (en) Document copy protection method
Amano et al. A feature calibration method for watermarking of document images
EP1667422A1 (en) Printed matter processing system, watermark-containing document printing device, watermark-containing document read device, printed matter processing method, information read device, and information read method
CN101119429A (en) Digital watermark embedded and extracting method and device
CN101366266A (en) Method and device for embedding and detecting digital watermark in text document
JP5669957B2 (en) Watermark image segmentation method and apparatus for Western language watermark processing
WO2022095312A1 (en) Electronic seal adding and verifying method and system
US20070030521A1 (en) Printed matter processing system, watermark-containing document printing device, watermark-containing document read device, printed matter processing method, information read device, and information read method
US8014559B2 (en) Information embedding apparatus, information embedding method, information extracting apparatus, information extracting method, computer program product
EP3477578A1 (en) Watermark embedding and extracting method for protecting documents
CN112085643B (en) Image desensitization processing method, verification method and device, equipment and medium
CN111738898A (en) Text digital watermark embedding \ extracting method and device
Stojanov et al. A new property coding in text steganography of Microsoft Word documents
Chotikakamthorn Electronic document data hiding technique using inter-character space
Chotikakamthorn Document image data hiding technique using character spacing width sequence coding
CN103024245A (en) System and method for tracing sources of printed paper documents
CN112990178B (en) Text digital information embedding and extracting method and system based on character segmentation
CN112966679A (en) Information tracing method and system based on minimum character connected domain deviation
Varna et al. Data hiding in hard-copy text documents robust to print, scan and photocopy operations
JP2008085579A (en) Device for embedding information, information reader, method for embedding information, method for reading information and computer program
CN113076528A (en) Anti-counterfeiting information embedding method, anti-counterfeiting information extracting method, anti-counterfeiting information embedding device, anti-counterfeiting information extracting device and storage medium
Funk et al. High capacity information hiding in music scores

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination