CN111046841A

CN111046841A - Character extraction method, system, terminal and storage medium of PowerPoint file

Info

Publication number: CN111046841A
Application number: CN201911365765.4A
Authority: CN
Inventors: 苗功勋; 崔新安; 董盼山; 王金国; 刘万芬
Original assignee: BEIJING ZHONGFU TAIHE TECHNOLOGY DEVELOPMENT CO LTD; Zhongfu Information Co Ltd; Zhongfu Safety Technology Co Ltd
Current assignee: BEIJING ZHONGFU TAIHE TECHNOLOGY DEVELOPMENT CO LTD; Zhongfu Information Co Ltd; Zhongfu Safety Technology Co Ltd
Priority date: 2019-12-26
Filing date: 2019-12-26
Publication date: 2020-04-21

Abstract

The invention provides a method, a system, a terminal and a storage medium for extracting characters of a PowerPoint file, wherein the method comprises the following steps: screening out text labels from the PowerPoint Document data stream; screening out character sub-labels by traversing the sub-label types of the character labels; extracting data from the text sub-label and converting the data into text fields; and summarizing and splicing a plurality of text fields extracted from all text labels to obtain a text file after splicing. The method provided by the invention has good compatibility, does not depend on the Office PowerPoint/Jinshan Dps component, does not need to install Office or extract the Office component file, and purely calls the system API function. The efficiency is high, the binary system reads the file and carries out accurate positioning processing, and the execution efficiency is obviously improved. The program package is small, all the realization is through oneself manual coding and calling system API function, does not rely on any third party program file.

Description

Character extraction method, system, terminal and storage medium of PowerPoint file

Technical Field

The invention relates to the technical field of character extraction, in particular to a character extraction method, a system, a terminal and a storage medium of a PowerPoint file.

Background

Microsoft PowerPoint is widely used in teaching and speech as demonstration Office software, is far-ahead in the aspect of market occupancy rate, and the format of PowerPoint files becomes a common form general standard in the industry, such as Jinshan Office Dps and the like.

In many application scenarios, characters in PowerPoint need to be extracted for secondary processing, such as a character inspection tool, a text comparison tool, a file format conversion tool, and the like, and how to extract characters in PowerPoint completely and efficiently becomes a problem faced at present.

At present, there are two general methods for extracting PowerPoint characters, one is to extract characters by using secondary interface development provided by officefwerpoint or Jinshan Dps software (hereinafter, the description of documents is collectively referred to as secondary interface development), and the other is to extract characters by using OPI technology provided by JAVA. However, these two techniques have the following disadvantages:

the secondary interface development has the following disadvantages:

the compatibility is poor. The method completely depends on Office PowerPoint or Jinshan Dps components to develop a secondary interface, and an Office PowerPoint program or a manual extraction Office PowerPoint component program must be installed in advance on an operating computer and integrated into an extraction tool. The method depends on Office PowerPoint, and character extraction failure can be caused if PowerPoint is installed insufficiently or configured incorrectly.

The efficiency is low. The development of the secondary interface adopts com technology, and the data is converted in multiple layers, so that the efficiency is low.

The JAVA OPI technique has the following disadvantages:

the program package is large. Utilizing JAVA OPI technology. The JAVA OPI technology extracts the characters in the document, a JAVA virtual machine environment is required to be carried in the running environment, and the method causes the program installation package to be too large.

The efficiency is low. JAVA runs are low, resulting in inefficient extraction of words.

Disclosure of Invention

In view of the above-mentioned deficiencies of the prior art, the present invention provides a method, a system, a terminal and a storage medium for extracting characters from a PowerPoint file, so as to solve the above-mentioned technical problems.

In a first aspect, the present invention provides a text extraction method for a PowerPoint file, including:

screening out text labels from the PowerPoint Document data stream;

screening out character sub-labels by traversing the sub-label types of the character labels;

extracting data from the text sub-label and converting the data into text fields;

and summarizing and splicing a plurality of text fields extracted from all text labels to obtain a text file after splicing.

Further, the screening of text labels from PowerPoint Document data streams includes:

reading a tag head from the starting position of the PowerPoint Document data flow tag tree;

traversing the tag head header and tag head types of all tags of the tag tree;

labels with a label head Type value of 0x03EE or 0x03F0 are screened out as text labels.

Further, the screening out the text sub-labels by traversing the sub-label types of the text labels includes:

traversing the sub-label head Type value of the character label;

and screening the sub-labels with the label head Type value of 0x0FA0 or 0x0FA8 as character sub-labels.

Further, the extracting data from the text sub-label includes:

reading a data length value in a label head of the text sub-label;

calculating the actual length value of the data extracted from the text sub-label;

and comparing the actual length value with the data length value, and generating an error prompt if the actual length value and the data length value are not consistent.

Further, the summarizing and splicing the characters extracted from all the character labels includes:

sorting the text labels according to the positions of the text labels in the label tree, wherein the positions comprise label tree levels and layer sequences;

and splicing the corresponding text fields according to the text label sequencing.

In a second aspect, the present invention provides a text extraction system for PowerPoint files, including:

the first screening unit is configured to screen the text labels from the PowerPoint Document data stream;

the second screening unit is configured to screen out the text sub-labels by traversing the sub-label types of the text labels;

the format conversion unit is configured to extract data from the text sub-label and convert the data into text fields;

and the character splicing unit is configured and used for gathering and splicing a plurality of character fields extracted from all the character labels to obtain a character file after splicing.

Further, the first screening unit includes:

the system comprises a tag reading module, a data processing module and a data processing module, wherein the tag reading module is configured to read a tag head from the starting position of a PowerPoint Document data stream tag tree;

the label traversal module is used for configuring label head header and label head types for traversing all labels of the label tree;

and the label screening module is configured to screen out labels with the label head Type value of 0x03EE or 0x03F0 as character labels.

Further, the data extraction unit includes:

the length reading module is configured to read a data length value in a label head of the text sub-label;

the length calculation module is configured to calculate an actual length value of the data extracted from the text sub-label;

and the length comparison module is configured to compare the actual length value with the data length value, and if the actual length value and the data length value are not consistent, an error prompt is generated.

In a third aspect, a terminal is provided, including:

a processor, a memory, wherein,

the memory is used for storing a computer program which,

the processor is used for calling and running the computer program from the memory so as to make the terminal execute the method of the terminal.

In a fourth aspect, a computer storage medium is provided having stored therein instructions that, when executed on a computer, cause the computer to perform the method of the above aspects.

The beneficial effect of the invention is that,

according to the method, the system, the terminal and the storage medium for extracting the characters of the PowerPoint file, which are provided by the invention, all the characters in the PowerPoint file are extracted by analyzing the binary data of the PowerPoint file according to the storage principle of the characters in the file. The invention is not limited to the character extraction of Office PowerPoint files, and all files adopting the Office PowerPoint character storage principle can be extracted by adopting the method, such as Jinshan Office Dps and the like. Therefore, the method provided by the invention has good compatibility, does not depend on the Office PowerPoint/Jinshan Dps component, does not need to install Office or extract the Office component file, and purely calls the system API function. The efficiency is high, the binary system reads the file and carries out accurate positioning processing, and the execution efficiency is obviously improved. The program package is small, all the realization is through oneself manual coding and calling system API function, does not rely on any third party program file.

In addition, the invention has reliable design principle, simple structure and very wide application prospect.

Drawings

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present invention, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.

FIG. 1 is a schematic flow diagram of a method of one embodiment of the invention.

Fig. 2 is a diagram of a label tree topology of a method of one embodiment of the invention.

Fig. 3 is a sub-label topology structure diagram of a text label (Slide label) of the method according to an embodiment of the present invention.

FIG. 4 is a schematic flow chart of a method of one embodiment of the present invention.

FIG. 5 is a schematic block diagram of a system of one embodiment of the present invention.

Fig. 6 is a schematic structural diagram of a terminal according to an embodiment of the present invention.

Detailed Description

In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the drawings in the embodiment of the present invention, and it is obvious that the described embodiment is only a part of the embodiment of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The following explains key terms appearing in the present invention.

FIG. 1 is a schematic flow diagram of a method of one embodiment of the invention. The executive agent in fig. 1 may be a doctor-patient interaction management system.

As shown in fig. 1, the method 100 includes:

step 110, screening out text labels from the PowerPoint Document data stream;

step 120, screening out character sub-labels by traversing the sub-label types of the character labels;

step 130, extracting data from the text sub-label, and converting the data into text fields;

and 140, collecting and splicing a plurality of text fields extracted from all the text labels to obtain a text file after splicing.

Optionally, as an embodiment of the present invention, the screening of the text labels from the PowerPoint Document data stream includes:

traversing the tag head header and tag head types of all tags of the tag tree;

Optionally, as an embodiment of the present invention, the screening out text sub-labels by traversing sub-label types of text labels includes:

traversing the sub-label head Type value of the character label;

Optionally, as an embodiment of the present invention, the extracting data from the text sub-label includes:

reading a data length value in a label head of the text sub-label;

Optionally, as an embodiment of the present invention, the collecting and splicing the texts extracted from all the text labels includes:

In order to facilitate understanding of the present invention, the following describes the text extraction method for the PowerPoint file provided by the present invention further by using the principle of the text extraction method for the PowerPoint file of the present invention and combining the text extraction process for the PowerPoint file in the embodiment.

In binary data of a PowerPoint document, all data of PowerPoint are associated through labels in a tree form, the labels are not limited to levels, and the label tree is shown in fig. 2.

The binary data of the PowerPoint document mainly contains the following tags: the method comprises a Current User tag (Current User), a User Edit tag (User Edit), a Document tag (Document), a text tag (Slide) and a remark tag (Notes). Each label contains many sub-labels, and some characteristic attributes of the label, including text, style, and the like, are recorded in the sub-labels. Each tag contains an 8-byte tag Header (Header) indicating the Type (Type) and Length (Length) of the tag, etc.

Text is mainly present in text labels (Slide) and Notes labels (Notes). The Slide labels store text parts in the PowerPoint documents, and the number of Slide labels is the number of pages of one PowerPoint. The Notes label holds the remark section of the PowerPoint document.

Slide and Notes labels store a character string label (TextCharsAtom) in a Unicode mode and a character string label (TextBytesAtom) in an ANSI mode, and the two sub-labels store character contents which are PowerPoint.

Slide tags are similar to Notes tag text storage structures, and the Slide tag topology structure is shown in fig. 3.

Referring to fig. 4, in detail, the text extraction method for the PowerPoint file includes:

s1, screening out character labels from PowerPoint Document data stream

The binary mode opens the PowerPoint Document, opens the PowerPoint Document data stream in the Document, then reads the 8-byte label head (Header) from the starting position, and stores the Type (Type) and the size (Length) of the current label in the label head.

And according to the Type value of the read Header, finding out tags with the Type values of 0x03EE (Slide) and 0x03F0(Notes), if the tags are not the two tags, automatically shifting downwards through the Length value of the Header, and circularly reading the Header of 8 bytes for judgment until the whole traversal is finished.

There may be multiple Slide and Notes tags in the PowerPoint Document, so it needs to traverse all the tags, and cannot stop because one tag is found.

S2, sift out the text sub-label (read TextBytesAtom and TextCharsAtom labels) by traversing the sub-label type of the text label.

And traversing the sub-tags of Slide and Notes layer by layer according to the tags of Slide and Notes, and finding the Type value of the Header to be 0x0FA0(TextCharsAtom) or 0x0FA8 (TextBytesAtom).

And S3, extracting data from the text sub-label and converting the data into text fields.

Extracting data character parts from the character sub-tags screened in the step S2, wherein Length in the sub-tag Header in the TextCharsAtom and TextBytesatom tags represents the byte Length occupied by the data, calculating whether the extracted data Length is consistent with the data Length in the tag Header, and generating an error prompt if the extracted data Length is not consistent with the data Length in the tag Header. The storage formats of the extracted data are Unicode and ANSI, respectively, so that the extracted data needs to be converted into characters.

And S4, collecting and splicing the plurality of text fields extracted from all the text labels, and obtaining the text file after splicing.

In step S1, multiple file tags may be screened out, multiple TextCharsAtom and TextBytesAtom tags may be stored in one file sub-tag (Slide tag or Notes tag), the characters in each TextCharsAtom and TextBytesAtom tag are sequentially taken out for splicing, the character tags are sorted according to the positions of the character tags in the tag tree, the positions include the tag tree level and the layer order, and the corresponding character fields are spliced according to the character tag sorting. All the characters in the tags of TextCharsAtOM and TextBytesAtom are taken out and spliced, and the characters are all the characters of the whole PowerPointDocument file. And outputting the character file obtained after splicing.

As shown in fig. 5, the system 500 includes:

a first filtering unit 510 configured to filter text labels from a PowerPoint Document data stream;

a second screening unit 520 configured to screen out text sub-labels by traversing sub-label types of the text labels;

a format conversion unit 530 configured to extract data from the text sub-label and convert the data into a text field;

and the character splicing unit 540 is configured to collect and splice a plurality of character fields extracted from all the character labels, and obtain a character file after splicing.

Optionally, as an embodiment of the present invention, the first screening unit includes:

Optionally, as an embodiment of the present invention, the data extracting unit includes:

Fig. 6 is a schematic structural diagram of a terminal system 600 according to an embodiment of the present invention, where the terminal system 600 may be used to execute the text extraction method for a PowerPoint file according to the embodiment of the present invention.

The terminal system 600 may include: a processor 610, a memory 620, and a communication unit 630. The components communicate via one or more buses, and those skilled in the art will appreciate that the architecture of the servers shown in the figures is not intended to be limiting, and may be a bus architecture, a star architecture, a combination of more or less components than those shown, or a different arrangement of components.

The memory 620 may be used for storing instructions executed by the processor 610, and the memory 620 may be implemented by any type of volatile or non-volatile storage terminal or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk or optical disk. The executable instructions in memory 620, when executed by processor 610, enable terminal 600 to perform some or all of the steps in the method embodiments described below.

The processor 610 is a control center of the storage terminal, connects various parts of the entire electronic terminal using various interfaces and lines, and performs various functions of the electronic terminal and/or processes data by operating or executing software programs and/or modules stored in the memory 620 and calling data stored in the memory. The processor may be composed of an Integrated Circuit (IC), for example, a single packaged IC, or a plurality of packaged ICs connected with the same or different functions. For example, the processor 610 may include only a Central Processing Unit (CPU). In the embodiment of the present invention, the CPU may be a single operation core, or may include multiple operation cores.

A communication unit 630, configured to establish a communication channel so that the storage terminal can communicate with other terminals. And receiving user data sent by other terminals or sending the user data to other terminals.

The present invention also provides a computer storage medium, wherein the computer storage medium may store a program, and the program may include some or all of the steps in the embodiments provided by the present invention when executed. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM) or a Random Access Memory (RAM).

Therefore, the binary data of the PowerPoint document is analyzed, and the characters in the PowerPoint document are all extracted according to the storage principle of the characters in the document. The invention is not limited to the character extraction of Office PowerPoint files, and all files adopting the Office PowerPoint character storage principle can be extracted by adopting the method, such as Jinshan Office Dps and the like. Therefore, the method provided by the invention has good compatibility, does not depend on the Office PowerPoint/Jinshan Dps component, does not need to install Office or extract the Office component file, and purely calls the system API function. The efficiency is high, the binary system reads the file and carries out accurate positioning processing, and the execution efficiency is obviously improved. The program package is small, all the implementation is realized through manual coding and calling of the system API function by the program package, and no third-party program file is relied on, so that the technical effect achieved by the embodiment can be referred to the description above, and the details are not repeated here.

Those skilled in the art will readily appreciate that the techniques of the embodiments of the present invention may be implemented as software plus a required general purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be embodied in the form of a software product, where the computer software product is stored in a storage medium, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and the like, and the storage medium can store program codes, and includes instructions for enabling a computer terminal (which may be a personal computer, a server, or a second terminal, a network terminal, and the like) to perform all or part of the steps of the method in the embodiments of the present invention.

The same and similar parts in the various embodiments in this specification may be referred to each other. Especially, for the terminal embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant points can be referred to the description in the method embodiment.

In the embodiments provided in the present invention, it should be understood that the disclosed system and method can be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, systems or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

Although the present invention has been described in detail by referring to the drawings in connection with the preferred embodiments, the present invention is not limited thereto. Various equivalent modifications or substitutions can be made on the embodiments of the present invention by those skilled in the art without departing from the spirit and scope of the present invention, and these modifications or substitutions are within the scope of the present invention/any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for extracting characters of a PowerPoint file is characterized by comprising the following steps:

screening out text labels from the PowerPoint Document data stream;

2. The method of claim 1, wherein the screening text labels from PowerPoint Document data streams comprises:

traversing the tag head header and tag head types of all tags of the tag tree;

3. The method of claim 1, wherein filtering out text sub-labels by traversing sub-label types of text labels comprises:

traversing the sub-label head Type value of the character label;

4. The method of claim 1, wherein extracting data from text sub-labels comprises:

reading a data length value in a label head of the text sub-label;

5. The method of claim 2, wherein the summarizing and splicing the texts extracted from all the text labels comprises:

6. A system for extracting characters from PowerPoint files is characterized by comprising:

7. The system of claim 6, wherein the first screening unit comprises:

8. The system of claim 6, wherein the data extraction unit comprises:

9. A terminal, comprising:

a processor;

a memory for storing instructions for execution by the processor;

wherein the processor is configured to perform the method of any one of claims 1-5.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-5.