CN112733056A - Document processing method, device, equipment and storage medium - Google Patents

Document processing method, device, equipment and storage medium Download PDF

Info

Publication number
CN112733056A
CN112733056A CN202110359276.9A CN202110359276A CN112733056A CN 112733056 A CN112733056 A CN 112733056A CN 202110359276 A CN202110359276 A CN 202110359276A CN 112733056 A CN112733056 A CN 112733056A
Authority
CN
China
Prior art keywords
document
picture
target
formula
processed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110359276.9A
Other languages
Chinese (zh)
Other versions
CN112733056B (en
Inventor
吉玉婷
赵永康
马义
李钢江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baijiayun Group Ltd
Shenzhen Baishilian Technology Co Ltd
Original Assignee
Baijiayun Group Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Baijiayun Group Ltd filed Critical Baijiayun Group Ltd
Priority to CN202110359276.9A priority Critical patent/CN112733056B/en
Publication of CN112733056A publication Critical patent/CN112733056A/en
Application granted granted Critical
Publication of CN112733056B publication Critical patent/CN112733056B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9577Optimising the visualization of content, e.g. distillation of HTML documents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • G06F16/148File search processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • G06F16/164File meta data generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/172Caching, prefetching or hoarding of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Library & Information Science (AREA)
  • Information Transfer Between Computers (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The application provides a document processing method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring pictures and a document to be analyzed contained in a document to be processed from the document to be processed, and storing each acquired picture in a remote server; aiming at each acquired picture, establishing a corresponding replacement path of the picture in the document to be analyzed according to the position of the picture in the document to be processed and the storage address of the picture in the remote server; analyzing the document to be analyzed, and determining an xml tag corresponding to each media element in the document to be analyzed; and converting the document to be analyzed into a display document in a target format based on each determined xml tag and the replacement path corresponding to each picture. Therefore, the document to be processed is processed into a format which can be displayed in a webpage, and the processing efficiency of the document can be improved on the basis of meeting the on-line question bank input requirement.

Description

Document processing method, device, equipment and storage medium
Technical Field
The invention relates to the technical field of data processing, in particular to a document processing method, a document processing device, document processing equipment and a storage medium.
Background
With the development of information technology, the education industry gradually develops from offline to online, and more students begin to tend to learn by means of online education. For online education application programs, frequently, question bank data of various disciplines needs to be input, and then the input question bank data is displayed to a user in a webpage mode, so that the user can answer questions online.
In the current method, when inputting question bank data, a common habit or for convenience of filing is that a related topic document is generally created through common office software such as word or Excel, then, a com (component) component is used to read the topic document of a word version, the content of the read topic document is input into an online question bank, and the read topic document is converted into an html (hypertext markup language) format to be displayed in a webpage. Because the topic document generally contains various media elements such as pictures, characters, tables, formulas and the like, and the com component cannot convert the pictures and the formulas in the topic document into a format capable of displaying html in a webpage, the conventional method for processing the topic document by utilizing the com component cannot meet the actual on-line topic library entry requirement, so that the document processing efficiency is low.
Disclosure of Invention
In view of the above, the present invention provides a document processing method, apparatus, device and storage medium, so as to improve the processing efficiency of a document on the basis of satisfying the requirement of online question bank entry.
In a first aspect, an embodiment of the present application provides a document processing method, where the method includes:
the method comprises the steps of obtaining pictures and documents to be analyzed contained in the documents to be processed from the documents to be processed, and storing each obtained picture in a remote server, wherein the documents to be processed are word documents containing various media elements of different types, and the documents to be analyzed are the documents to be processed in an xml format;
aiming at each acquired picture, establishing a corresponding replacement path of the picture in the document to be analyzed according to the position of the picture in the document to be processed and the storage address of the picture in the remote server;
analyzing the document to be analyzed, and determining an xml tag corresponding to each media element in the document to be analyzed, wherein the xml tag at least comprises: the formula label is used for representing a formula and the picture label is used for representing a picture;
and converting the document to be analyzed into a display document in a target format based on each determined xml tag and the alternative path corresponding to each picture, wherein the target format is a document format capable of being displayed to a user in a webpage.
Optionally, the obtaining, from the document to be processed, the picture and the document to be analyzed included in the document to be processed includes:
decompressing the document to be processed by using a decompressing tool ziparcive to obtain the picture contained in the document to be processed and the document to be analyzed.
Optionally, the storing each acquired picture in a remote server includes:
determining the insertion sequence of the picture in the document to be processed according to the position of the picture in the document to be processed;
and taking the insertion sequence of the picture in the document to be processed as the storage file name of the picture, and storing the storage file name into the remote server.
Optionally, the establishing a corresponding alternative path of the picture in the document to be analyzed according to the position of the picture in the document to be processed and the storage address of the picture in the remote server includes:
searching a storage file name matched with the insertion sequence from the storage file names of the pictures stored in the remote server by using the insertion sequence of the picture in the document to be processed as a target file name;
extracting a storage address where the picture of the target file name is located from the remote server as a target storage address;
and taking the extracted target storage address as a corresponding replacement path of the picture in the document to be analyzed, wherein the replacement path is used for loading the picture in a webpage in a mode of accessing a remote picture.
Optionally, when the determined xml tags are the picture tags, the converting the document to be parsed into a presentation document in a target format based on each determined xml tag and the alternative path corresponding to each picture includes:
acquiring the replacement path corresponding to a target picture aiming at each picture tag, wherein the target picture is a picture marked by the picture tag in the document to be analyzed;
searching a replacement picture of the target picture from the remote server by using a replacement path corresponding to the target picture in a remote access mode, wherein the replacement picture is the target picture which can be loaded in a webpage;
and displaying the searched replacement picture to a user at a first target vacancy in the display document, wherein the first target vacancy is a position in the display document where the target picture needs to be inserted.
Optionally, when the determined xml tag is the formula tag, the converting the document to be parsed into a presentation document in a target format based on each determined xml tag and the alternative path corresponding to each picture includes:
for each formula label, obtaining a formula data line marked by the formula label from the document to be analyzed, wherein the formula data line is a permutation and combination of numbers, letters and operation symbols forming a target formula, and the target formula is a formula marked by the formula label in the document to be analyzed correspondingly;
marking each character included in the formula data line by using a region interval mark, and determining a word segmentation marking result of the formula data line, wherein the region interval mark is a sub-label used for identifying different types of characters in the formula label;
adjusting the display position of each character in the formula data line by utilizing a cascading style sheet according to the format of the target formula to obtain a formula to be imported for displaying in a webpage;
and displaying the formula to be introduced to a user at a second target vacancy in the display document, wherein the second target vacancy is a position in the display document where the target formula needs to be inserted.
Optionally, the determining the xml tag corresponding to each media element in the document to be parsed further includes:
judging whether the document to be analyzed contains a target media element, wherein the target media element is as follows: a formula inserted in a picture format;
and if the document to be analyzed contains the target media element, taking the picture tag as an xml tag corresponding to the target media element.
In a second aspect, an embodiment of the present application provides a document processing apparatus, including:
the data acquisition module is used for acquiring pictures and documents to be analyzed contained in the documents to be processed from the documents to be processed and storing each acquired picture in a remote server, wherein the documents to be processed are word documents containing various media elements of different types, and the documents to be analyzed are the documents to be processed in an xml format;
the resource replacing module is used for establishing a corresponding replacing path of each acquired picture in the document to be analyzed according to the position of the picture in the document to be processed and the storage address of the picture in the remote server;
a document analysis module, configured to analyze the document to be analyzed, and determine an xml tag corresponding to each media element in the document to be analyzed, where the xml tag at least includes: the formula label is used for representing a formula and the picture label is used for representing a picture;
and the document conversion module is used for converting the document to be analyzed into a display document in a target format based on each determined xml tag and the replacement path corresponding to each picture, wherein the target format is a document format which can be displayed to a user in a webpage.
Optionally, the data obtaining module is further configured to:
decompressing the document to be processed by using a decompressing tool ziparcive to obtain the picture contained in the document to be processed and the document to be analyzed.
Optionally, the data obtaining module is further configured to:
determining the insertion sequence of the picture in the document to be processed according to the position of the picture in the document to be processed;
and taking the insertion sequence of the picture in the document to be processed as the storage file name of the picture, and storing the storage file name into the remote server.
Optionally, the resource replacing module is further configured to:
searching a storage file name matched with the insertion sequence from the storage file names of the pictures stored in the remote server by using the insertion sequence of the picture in the document to be processed as a target file name;
extracting a storage address where the picture of the target file name is located from the remote server as a target storage address;
and taking the extracted target storage address as a corresponding replacement path of the picture in the document to be analyzed, wherein the replacement path is used for loading the picture in a webpage in a mode of accessing a remote picture.
Optionally, when the determined xml tag is the picture tag, the document conversion module is further configured to:
acquiring the replacement path corresponding to a target picture aiming at each picture tag, wherein the target picture is a picture marked by the picture tag in the document to be analyzed;
searching a replacement picture of the target picture from the remote server by using a replacement path corresponding to the target picture in a remote access mode, wherein the replacement picture is the target picture which can be loaded in a webpage;
and displaying the searched replacement picture to a user at a first target vacancy in the display document, wherein the first target vacancy is a position in the display document where the target picture needs to be inserted.
Optionally, when the determined xml tag is the formula tag, the document conversion module is further configured to:
for each formula label, obtaining a formula data line marked by the formula label from the document to be analyzed, wherein the formula data line is a permutation and combination of numbers, letters and operation symbols forming a target formula, and the target formula is a formula marked by the formula label in the document to be analyzed correspondingly;
marking each character included in the formula data line by using a region interval mark, and determining a word segmentation marking result of the formula data line, wherein the region interval mark is a sub-label used for identifying different types of characters in the formula label;
adjusting the display position of each character in the formula data line by utilizing a cascading style sheet according to the format of the target formula to obtain a formula to be imported for displaying in a webpage;
and displaying the formula to be introduced to a user at a second target vacancy in the display document, wherein the second target vacancy is a position in the display document where the target formula needs to be inserted.
Optionally, the document parsing module is further configured to:
judging whether the document to be analyzed contains a target media element, wherein the target media element is as follows: a formula inserted in a picture format;
and if the document to be analyzed contains the target media element, taking the picture tag as an xml tag corresponding to the target media element.
In a third aspect, an embodiment of the present application provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of the document processing method when executing the computer program.
In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps of the document processing method.
The technical scheme provided by the embodiment of the application can have the following beneficial effects:
converting a word format document to be processed containing various media elements of different types into an xml format document to be analyzed, and storing pictures contained in the document to be processed into a remote server; because the document to be analyzed is a document in an xml format, and media elements such as pictures and formulas in the document cannot be directly displayed in a webpage, on one hand, for the media elements belonging to the picture type: according to the method, for each picture contained in a document to be processed, a corresponding alternative path of the picture in the document to be analyzed is established according to the position of the picture in the document to be processed and the storage address of the picture in the remote server; therefore, the media element can be normally displayed on the webpage according to the corresponding alternative path of the media element and the mode of accessing the remote picture.
On the other hand, after the document to be analyzed is analyzed, the xml tags corresponding to the determined media elements can be utilized, when the document to be analyzed is converted into a display document which can be displayed to a user in a webpage, the media elements corresponding to the formula tags can be identified as a formula, and the technical problem that the com component in the prior art cannot identify the formula in the title document is solved. Therefore, the document processing method provided by the application can be used for respectively processing the pictures and the formulas in the subject documents, and processing the pictures and the formulas into the formats which can be displayed to users in the webpage, so that the processing efficiency of the documents is effectively improved on the basis of meeting the on-line question bank input requirements.
In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
FIG. 1 is a flow chart illustrating a document processing method provided by an embodiment of the present application;
FIG. 2 is a flowchart illustrating a method for processing a picture in a document to be parsed according to an embodiment of the present application;
FIG. 3 is a flowchart illustrating a method for processing a formula in a document to be parsed according to an embodiment of the present application;
FIG. 4 is a schematic structural diagram of a document processing apparatus provided in an embodiment of the present application;
fig. 5 shows a schematic structural diagram of a computer device 500 according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention provides a document processing method, a document processing device, a document processing apparatus and a storage medium, which are described below through embodiments.
Example one
FIG. 1 is a flowchart illustrating a document processing method provided by an embodiment of the present application, wherein the method includes steps S101-S104; specifically, the method comprises the following steps:
s101, obtaining pictures and documents to be analyzed contained in the documents to be processed from the documents to be processed, and storing each obtained picture in a remote server.
Specifically, the document to be processed is a word document containing a plurality of media elements of different types, and the document to be analyzed is the document to be processed in an xml format. The media elements can be pictures, formulas, words, tables and other elements used for showing document information to users.
By way of example, a problem document created by a user for a mathematical subject may be used as a to-be-processed document, where the to-be-processed document includes at least: characters, tables, pictures and mathematical formulas appearing in the questions of the mathematical exercises; the document to be processed can be a word document with a docx suffix name, format conversion is carried out on the document to be processed, the document format of the document to be processed is converted into an xml format from the docx format, and the document to be analyzed corresponding to the document to be processed is obtained.
In this embodiment, as an optional embodiment, the obtaining, from a document to be processed, a picture and a document to be parsed included in the document to be processed includes:
decompressing the document to be processed by using a decompressing tool ziparcive to obtain the picture contained in the document to be processed and the document to be analyzed.
Illustratively, still taking the document to be processed in the above example as an example, decompressing the document to be processed by using a decompressing tool ziparichve, so as to obtain the document to be parsed in xml format and a media (media) picture resource folder, where the media picture resource folder is composed of each picture included in the document to be processed.
Specifically, in this embodiment of the present application, as an optional embodiment, the storing each acquired picture in a remote server includes:
determining the insertion sequence of the picture in the document to be processed according to the position of the picture in the document to be processed;
and taking the insertion sequence of the picture in the document to be processed as the storage file name of the picture, and storing the storage file name into the remote server.
Illustratively, if a document to be processed contains 10 mathematical topics, wherein a picture a appears in a first mathematical topic, a picture b appears in a fifth mathematical topic, and a picture c appears in an eighth mathematical topic, it may be determined that the insertion order of the picture a in the document to be processed is fig. 1, the insertion order of the picture b in the document to be processed is fig. 2, and the insertion order of the picture c in the document to be processed is fig. 3, and after the picture a, the picture b, and the picture c in the document to be processed are acquired, the picture a, the picture b, and the picture c in the document to be processed are stored in the remote server by using the storage file name of the picture a in fig. 1, the storage file name of the picture b in fig. 2, and the storage file name of the picture c in fig. 3.
It should be noted that, when processing a plurality of documents to be processed, the remote server establishes a separate picture folder for each document to be processed according to the document name of each document to be processed, so as to store and distinguish pictures contained in different documents to be processed.
S102, aiming at each acquired picture, establishing a corresponding replacement path of the picture in the document to be analyzed according to the position of the picture in the document to be processed and the storage address of the picture in the remote server.
Specifically, in this embodiment, as an optional embodiment, the establishing, according to the position of the picture in the document to be processed and the storage address of the picture in the remote server, a corresponding alternative path of the picture in the document to be analyzed includes:
searching a storage file name matched with the insertion sequence from the storage file names of the pictures stored in the remote server by using the insertion sequence of the picture in the document to be processed as a target file name;
extracting a storage address where the picture of the target file name is located from the remote server as a target storage address;
and taking the extracted target storage address as a corresponding replacement path of the picture in the document to be analyzed, wherein the replacement path is used for loading the picture in a webpage in a mode of accessing a remote picture.
Illustratively, taking the picture a in the above example as an example, according to the insertion order of the picture a in the document to be processed is fig. 1, searching for a picture with a storage file name of fig. 1 from a picture folder of the document to be processed stored in the remote server, and if the storage address of the searched picture of fig. 1 in the remote server is: www/media/fig. 1 (fig. 1 stored under the media folder in the remote server); and taking the storage address as a corresponding alternative path of the picture a at the picture a in the first mathematical topic in the document to be analyzed, so that the picture a in the first mathematical topic can be shown to the user in a final page of the online topic library webpage in a manner of accessing a remote picture, wherein the address for accessing the remote picture is http: // domain name/media/FIG. 1.
S103, analyzing the document to be analyzed, and determining the xml tags corresponding to the media elements in the document to be analyzed.
Specifically, as an optional embodiment, the document to be parsed in the xml format may be read by using a DomDocument instruction based on the PHP scripting language, where the DomDocument instruction is used to provide an initial (or top-most) access entry of the document to be parsed, each media element in the document to be parsed is used as a reading node to be read, and then, according to the attribute information of each read media element, an xml tag corresponding to the media element is parsed.
Specifically, the xml tag at least comprises: formula tags for characterizing formulas and picture tags for characterizing pictures.
In an exemplary description, a document to be analyzed containing various media elements of different types, such as a text, a picture, a table, a formula, and the like, is analyzed, and if an xml tag corresponding to a media element x is a < oMath > tag, it can be determined that the media element x belongs to the formula; if the xml tag corresponding to the media element y is the < imagedata > tag, determining that the media element y belongs to the picture; if the xml tag corresponding to media element z is a < tbl > tag, then it can be determined that media element z belongs to the table; wherein the < oMath > tag corresponds to the formula tag, and the < imagedata > tag corresponds to the picture tag.
It should be noted that, considering that the existence form of the formula in the document to be processed created by the user is not unique, for example: the user can insert an editable version of the formula through the formula editor, and can also insert a formula in a picture format. Therefore, in this embodiment of the present application, as an optional embodiment, the determining an xml tag corresponding to each media element in the document to be parsed further includes:
judging whether the document to be analyzed contains a target media element, wherein the target media element is as follows: a formula inserted in a picture format;
and if the document to be analyzed contains the target media element, taking the picture tag as an xml tag corresponding to the target media element.
Illustratively, if it is determined that a formula inserted in a picture format exists in the document to be parsed, the formula in the picture format is taken as a target media element, and the target media element is processed in a picture processing manner.
S104, converting the document to be analyzed into a display document in a target format based on each determined xml tag and the replacement path corresponding to each picture.
Specifically, the target format refers to a document format that can be presented to a user in a web page.
It should be noted that the document to be parsed is a document in xml format, media elements such as pictures and formulas in the document cannot be directly displayed in the web page, and therefore, according to each determined xml tag, the type of the media element included in the document to be parsed can be identified, for the media element belonging to the picture type, the media element can be normally displayed on the web page according to a replacement path corresponding to the media element and according to a manner of accessing a remote picture, and for the media element belonging to the formula type, a Mathjax js (front end integration) tool can be used to convert a < oMath > tag (also called a formula tag) into a div (partition mark) tag and a span tag (in-line tag) which are common in the web page document in html format, wherein the div tag is used to divide a data block and divide a corresponding data block into independent and different element parts, the span tag is an inline tag of the hypertext markup language, and is mostly used for combining inline elements in a document, and after conversion of the tag is completed, the display position of each number, letter or operation symbol in a formula can be adjusted by combining with a style sheet, so that the formula with a correct style is displayed to a user in a webpage.
In a possible implementation, when the xml tag is determined to be the picture tag, fig. 2 shows a flowchart of a method for processing a picture in a document to be parsed, which is provided by an embodiment of the present application, and as shown in fig. 2, when step S104 is executed, the method further includes S201-S203; specifically, the method comprises the following steps:
s201, aiming at each picture label, the replacement path corresponding to the target picture is obtained.
Specifically, the target picture is a picture marked by the picture tag in the document to be parsed.
Taking the example of step S101 as an example, the document to be processed collectively includes 10 mathematical topics, where a picture a appears in the first mathematical topic, a picture b appears in the fifth mathematical topic, and a picture c appears in the eighth mathematical topic, and since the document to be analyzed is only the document to be processed in xml format, the positions of the picture a, the picture b, and the picture c in the document to be analyzed are the same as the positions in the document to be processed, after the document to be analyzed is analyzed, the xml tags corresponding to the picture a, the picture b, and the picture c are all picture tags, and the storage address of the picture a is obtained from the remote server as a replacement path of the picture a, and is denoted as Xa; acquiring a storage address of the picture b as a replacement path of the picture b, and recording the storage address as Xb; and acquiring the storage address of the picture c as a replacement path of the picture c, and recording the storage address as Xc.
S202, searching for the replacement picture of the target picture from the remote server in a remote access mode by using the replacement path corresponding to the target picture.
Specifically, the replacement picture is the target picture that can be loaded in a web page.
Taking picture a as an example, picture a is a picture inserted first in a document to be parsed, a storage address of picture a in a remote server (i.e. a replacement path of picture a) is denoted as Xa, a storage file name of picture a in the remote server is denoted as fig. 1, and by using the replacement path of picture a, the replacement picture of picture a in fig. 1 can be found from the remote server by way of remote access.
S203, displaying the searched replacement picture to the user at the first target vacancy in the display document.
Specifically, the first target slot is a position in the presentation document where the target picture needs to be inserted.
In an exemplary description, still taking the picture a in the above example as an example, because the picture a is located in the first mathematical topic of the document to be parsed, the first target vacancy of the picture a in the display document is also in the first mathematical topic, so that when the document to be parsed is converted into the display document, the replacement picture fig. 1 of the picture a can be found from the remote server by using the replacement path of the picture a in a remote access manner, the picture 1 is inserted into the first mathematical topic of the display document, and the display is performed to the user in a form of a webpage, so that the user can perform online answering in the webpage conveniently.
In a possible implementation, when the determined xml tag is the formula tag, fig. 3 shows a flowchart of a method for processing a formula in a document to be parsed, which is provided by an embodiment of the present application, and as shown in fig. 3, when step S104 is executed, the method further includes S301-S304; specifically, the method comprises the following steps:
s301, aiming at each formula label, obtaining a formula data line marked by the formula label from the document to be analyzed.
Specifically, the formula data line is a permutation and combination of numbers, letters and operation symbols constituting a target formula, and the target formula is a formula in which the formula label is correspondingly marked in the document to be parsed. Exemplary illustrations of the use of the term in a document to be parsed<oMath >The label is used as the formula label, wherein,<oMath >the target formula marked by the label is:
Figure 854956DEST_PATH_IMAGE001
(ii) a Then the formula data behavior corresponding to the target formula can be obtained: y, the value of (= g,
Figure 121989DEST_PATH_IMAGE002
,×,m,+,b。
s302, marking each character included in the formula data line by using a region interval mark, and determining a word segmentation marking result of the formula data line, wherein the region interval mark is a sub-label used for identifying different types of characters in the formula label.
Specifically, as an optional embodiment, a < oMath > tag may be used as the formula tag to mark media elements belonging to a formula in a document to be parsed, where the < oMath > tag includes multiple sub-tags for identifying different types of characters in the formula, for example: the word segmentation method comprises the following steps that a < mi > sub-label used for marking letters, a < mn > sub-label used for marking numbers, a < mo > sub-label used for marking operation symbols and the like are utilized, the sub-labels in the formula labels are used as the region interval labels, each character included in a formula data line can be marked, and the word segmentation marking result of the formula data line is obtained.
Illustrative explanation, the target formula in the above example
Figure 383206DEST_PATH_IMAGE003
By way of example, utilize<oMath >The sub-label in the label marks each character in the formula data line, and can obtain:<mi>y</mi> <mo>=</mo> <mn>1</mn> <mo>-</mo> <mn>2</mn> <mo>×</mo> <mi>m</mi> <mo>+</mo> <mi>b</mi>wherein, in the step (A),</mi>sub-label representation<mi>The sub-label marks the end of the character range,</mo>sub-label representation<mo>The sub-label marks the end of the character range,</mn>sub-label representation<mn>The sub-label marks the end of the character range; thus, the formula data line is divided into independent characters, and the word segmentation marking result of the formula data line is obtained as follows: y, =, 1, -, 2, ×, m, +, b.
And S303, adjusting the display position of each character in the formula data line according to the format of the target formula by using a cascading style sheet to obtain a formula to be imported for displaying in a webpage.
Specifically, taking the sub-tag included in the < oMath > tag (i.e., the sub-tag in the formula tag) as an example of the area interval tag, on the basis of obtaining the segmentation tagging result of the formula data line, in combination with the css style sheet (which may also be called a cascading style sheet), each independent character of the segmentation tagging result can be strictly set, and the display position when the character is displayed in the web page.
Illustratively, taking the character "y" in the above example as an example, according to the < mi > sub-tag for marking the letter, the independent letter y can be recognized, each character marked by the sub-tag is converted into a well-defined tag with a specific class (e.g. class), that is, the character y can be displayed in the web page by using the attribute of the class tag in combination with the css style sheet, and the example of tag conversion is as follows:
< mi y </mi > to < mjx-mi class = "mjx-n" >)
<mjx-c class="mjx-c79">
</mjx-c>
</mjx-mi>
Wherein c79 represents the character y;
thus, the character y can be displayed in the webpage by using the attribute of the class label;
and for the characters "1" - "" 2 "in the above example, because of its format in the target formula:
Figure 52085DEST_PATH_IMAGE004
therefore, the recognized characters "1" - "" 2 "can be shown in the webpage as being combined with the css style sheet through the hierarchical relation contained between the labels
Figure 858498DEST_PATH_IMAGE004
The specific processing example is as follows:
<mjx-table space=4>
< mjx-row size = "s" > < mjx-row > (molecule 1)
< mjx-line > </mjx-line > (horizontal line)
< mjx-row size = "s" > < mjx-row > (denominator 2)
</mjx-table>
Thus, the width of each character is defined through a space instruction, the size of each character displayed in a webpage is defined through a size instruction, the size of the character "1" as a numerator and the size of the character "2" as a denominator displayed in the webpage are defined to be s-type and slightly smaller than other characters, and therefore, the characters "1" - "" 2 "in a formula data line are adjusted according to the format in the target formula, and the characters which are finally used for displaying in the webpage and are to be imported into the formula are obtained
Figure 612827DEST_PATH_IMAGE004
S304, displaying the formula to be introduced to the user at a second target vacancy in the display document.
Specifically, the second target slot is a position in the presentation document where the target formula needs to be inserted.
Illustratively, if the target formula is located in the second mathematical topic of the document to be analyzed, the second target vacancy is the second mathematical topic of the display document, the formula to be introduced, which is obtained after the target formula is adjusted in step S303, is introduced into the second mathematical topic of the display document to be displayed to the user, so that the user can conveniently answer the question online in the webpage.
Example two
Fig. 4 is a schematic structural diagram of a document processing apparatus provided in an embodiment of the present application, where the apparatus includes:
a data obtaining module 401, configured to obtain, from a to-be-processed document, a picture and a to-be-analyzed document that are included in the to-be-processed document, and store each obtained picture in a remote server, where the to-be-processed document is a word document that includes multiple different types of media elements, and the to-be-analyzed document is the to-be-processed document in an xml format;
a resource replacement module 402, configured to, for each obtained picture, establish a corresponding replacement path of the picture in the document to be analyzed according to a location of the picture in the document to be processed and a storage address of the picture in the remote server;
a document parsing module 403, configured to parse the document to be parsed, and determine an xml tag corresponding to each media element in the document to be parsed, where the xml tag at least includes: the formula label is used for representing a formula and the picture label is used for representing a picture;
a document conversion module 404, configured to convert, based on each determined xml tag and the alternative path corresponding to each picture, the document to be parsed into a presentation document in a target format, where the target format is a document format that can be presented to a user in a webpage.
Optionally, the data obtaining module 401 is further configured to:
decompressing the document to be processed by using a decompressing tool ziparcive to obtain the picture contained in the document to be processed and the document to be analyzed.
Optionally, the data obtaining module 401 is further configured to:
determining the insertion sequence of the picture in the document to be processed according to the position of the picture in the document to be processed;
and taking the insertion sequence of the picture in the document to be processed as the storage file name of the picture, and storing the storage file name into the remote server.
Optionally, the resource replacing module 402 is further configured to:
searching a storage file name matched with the insertion sequence from the storage file names of the pictures stored in the remote server by using the insertion sequence of the picture in the document to be processed as a target file name;
extracting a storage address where the picture of the target file name is located from the remote server as a target storage address;
and taking the extracted target storage address as a corresponding replacement path of the picture in the document to be analyzed, wherein the replacement path is used for loading the picture in a webpage in a mode of accessing a remote picture.
Optionally, when the determined xml tag is the picture tag, the document conversion module 404 is further configured to:
acquiring the replacement path corresponding to a target picture aiming at each picture tag, wherein the target picture is a picture marked by the picture tag in the document to be analyzed;
searching a replacement picture of the target picture from the remote server by using a replacement path corresponding to the target picture in a remote access mode, wherein the replacement picture is the target picture which can be loaded in a webpage;
and displaying the searched replacement picture to a user at a first target vacancy in the display document, wherein the first target vacancy is a position in the display document where the target picture needs to be inserted.
Optionally, when the determined xml tag is the formula tag, the document conversion module 404 is further configured to:
for each formula label, obtaining a formula data line marked by the formula label from the document to be analyzed, wherein the formula data line is a permutation and combination of numbers, letters and operation symbols forming a target formula, and the target formula is a formula marked by the formula label in the document to be analyzed correspondingly;
marking each character included in the formula data line by using a region interval mark, and determining a word segmentation marking result of the formula data line, wherein the region interval mark is a sub-label used for identifying different types of characters in the formula label;
adjusting the display position of each character in the formula data line by utilizing a cascading style sheet according to the format of the target formula to obtain a formula to be imported for displaying in a webpage;
and displaying the formula to be introduced to a user at a second target vacancy in the display document, wherein the second target vacancy is a position in the display document where the target formula needs to be inserted.
Optionally, the document parsing module 403 is further configured to:
judging whether the document to be analyzed contains a target media element, wherein the target media element is as follows: a formula inserted in a picture format;
and if the document to be analyzed contains the target media element, taking the picture tag as an xml tag corresponding to the target media element.
EXAMPLE III
As shown in fig. 5, an embodiment of the present application provides a computer device 500 for executing the document processing method in the present application, the device includes a memory 501, a processor 502 and a computer program stored on the memory 501 and executable on the processor 502, wherein the processor 502 implements the steps of the document processing method when executing the computer program.
Specifically, the memory 501 and the processor 502 may be general-purpose memory and processor, and are not limited to specific examples, and the document processing method can be executed when the processor 502 executes a computer program stored in the memory 501.
Corresponding to the document processing method in the present application, an embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, performs the steps of the document processing method described above.
In particular, the storage medium can be a general-purpose storage medium, such as a removable disk, a hard disk, or the like, on which a computer program can be executed when executed to perform the above-described document processing method.
In the embodiments provided in the present application, it should be understood that the disclosed system and method may be implemented in other ways. The above-described system embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and there may be other divisions in actual implementation, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of systems or units through some communication interfaces, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments provided in the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus once an item is defined in one figure, it need not be further defined and explained in subsequent figures, and moreover, the terms "first", "second", "third", etc. are used merely to distinguish one description from another and are not to be construed as indicating or implying relative importance.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the present disclosure, which should be construed in light of the above teachings. Are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A method of document processing, the method comprising:
the method comprises the steps of obtaining pictures and documents to be analyzed contained in the documents to be processed from the documents to be processed, and storing each obtained picture in a remote server, wherein the documents to be processed are word documents containing various media elements of different types, and the documents to be analyzed are the documents to be processed in an xml format;
aiming at each acquired picture, establishing a corresponding replacement path of the picture in the document to be analyzed according to the position of the picture in the document to be processed and the storage address of the picture in the remote server;
analyzing the document to be analyzed, and determining an xml tag corresponding to each media element in the document to be analyzed, wherein the xml tag at least comprises: the formula label is used for representing a formula and the picture label is used for representing a picture;
and converting the document to be analyzed into a display document in a target format based on each determined xml tag and the alternative path corresponding to each picture, wherein the target format is a document format capable of being displayed to a user in a webpage.
2. The method according to claim 1, wherein the obtaining of the picture and the document to be parsed included in the document to be processed from the document to be processed comprises:
decompressing the document to be processed by using a decompressing tool ziparcive to obtain the picture contained in the document to be processed and the document to be analyzed.
3. The method of claim 1, wherein storing each of the captured pictures in a remote server comprises:
determining the insertion sequence of the picture in the document to be processed according to the position of the picture in the document to be processed;
and taking the insertion sequence of the picture in the document to be processed as the storage file name of the picture, and storing the storage file name into the remote server.
4. The method according to claim 3, wherein the establishing a corresponding alternative path of the picture in the document to be parsed according to the position of the picture in the document to be processed and the storage address of the picture in the remote server comprises:
searching a storage file name matched with the insertion sequence from the storage file names of the pictures stored in the remote server by using the insertion sequence of the picture in the document to be processed as a target file name;
extracting a storage address where the picture of the target file name is located from the remote server as a target storage address;
and taking the extracted target storage address as a corresponding replacement path of the picture in the document to be analyzed, wherein the replacement path is used for loading the picture in a webpage in a mode of accessing a remote picture.
5. The method according to claim 1, wherein when the determined xml tags are the picture tags, the converting the document to be parsed into a presentation document in a target format based on each determined xml tag and the alternative path corresponding to each picture comprises:
acquiring the replacement path corresponding to a target picture aiming at each picture tag, wherein the target picture is a picture marked by the picture tag in the document to be analyzed;
searching a replacement picture of the target picture from the remote server by using a replacement path corresponding to the target picture in a remote access mode, wherein the replacement picture is the target picture which can be loaded in a webpage;
and displaying the searched replacement picture to a user at a first target vacancy in the display document, wherein the first target vacancy is a position in the display document where the target picture needs to be inserted.
6. The method according to claim 1, wherein when the determined xml tags are the formula tags, the converting the document to be parsed into a presentation document in a target format based on each determined xml tag and the alternative path corresponding to each picture comprises:
for each formula label, obtaining a formula data line marked by the formula label from the document to be analyzed, wherein the formula data line is a permutation and combination of numbers, letters and operation symbols forming a target formula, and the target formula is a formula marked by the formula label in the document to be analyzed correspondingly;
marking each character included in the formula data line by using a region interval mark, and determining a word segmentation marking result of the formula data line, wherein the region interval mark is a sub-label used for identifying different types of characters in the formula label;
adjusting the display position of each character in the formula data line by utilizing a cascading style sheet according to the format of the target formula to obtain a formula to be imported for displaying in a webpage;
and displaying the formula to be introduced to a user at a second target vacancy in the display document, wherein the second target vacancy is a position in the display document where the target formula needs to be inserted.
7. The method according to claim 1, wherein the determining an xml tag corresponding to each media element in the document to be parsed further comprises:
judging whether the document to be analyzed contains a target media element, wherein the target media element is as follows: a formula inserted in a picture format;
and if the document to be analyzed contains the target media element, taking the picture tag as an xml tag corresponding to the target media element.
8. A document processing apparatus, characterized in that the apparatus comprises:
the data acquisition module is used for acquiring pictures and documents to be analyzed contained in the documents to be processed from the documents to be processed and storing each acquired picture in a remote server, wherein the documents to be processed are word documents containing various media elements of different types, and the documents to be analyzed are the documents to be processed in an xml format;
the resource replacing module is used for establishing a corresponding replacing path of each acquired picture in the document to be analyzed according to the position of the picture in the document to be processed and the storage address of the picture in the remote server;
a document analysis module, configured to analyze the document to be analyzed, and determine an xml tag corresponding to each media element in the document to be analyzed, where the xml tag at least includes: the formula label is used for representing a formula and the picture label is used for representing a picture;
and the document conversion module is used for converting the document to be analyzed into a display document in a target format based on each determined xml tag and the replacement path corresponding to each picture, wherein the target format is a document format which can be displayed to a user in a webpage.
9. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when the electronic device is operating, the machine-readable instructions when executed by the processor performing the steps of the document processing method of any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, is adapted to carry out the steps of the document processing method according to any one of claims 1 to 7.
CN202110359276.9A 2021-04-02 2021-04-02 Document processing method, device, equipment and storage medium Active CN112733056B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110359276.9A CN112733056B (en) 2021-04-02 2021-04-02 Document processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110359276.9A CN112733056B (en) 2021-04-02 2021-04-02 Document processing method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112733056A true CN112733056A (en) 2021-04-30
CN112733056B CN112733056B (en) 2021-06-18

Family

ID=75596330

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110359276.9A Active CN112733056B (en) 2021-04-02 2021-04-02 Document processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112733056B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117149703A (en) * 2023-09-04 2023-12-01 上海易立德信息技术股份有限公司 File processing method and system
CN117275651A (en) * 2023-09-01 2023-12-22 北京华益精点生物技术有限公司 Medical report generation method and device and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104376008A (en) * 2013-08-14 2015-02-25 深圳市众鸿科技股份有限公司 Method for replacing XML (extensive markup language) absolute path
CN104809534A (en) * 2014-01-24 2015-07-29 北京理工大学 Business process management system
WO2017084174A1 (en) * 2015-11-19 2017-05-26 深圳市鹰硕技术有限公司 Image synchronous display method and device
CN111797336A (en) * 2020-07-07 2020-10-20 北京明略昭辉科技有限公司 Webpage parsing method and device, electronic equipment and medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104376008A (en) * 2013-08-14 2015-02-25 深圳市众鸿科技股份有限公司 Method for replacing XML (extensive markup language) absolute path
CN104809534A (en) * 2014-01-24 2015-07-29 北京理工大学 Business process management system
WO2017084174A1 (en) * 2015-11-19 2017-05-26 深圳市鹰硕技术有限公司 Image synchronous display method and device
CN111797336A (en) * 2020-07-07 2020-10-20 北京明略昭辉科技有限公司 Webpage parsing method and device, electronic equipment and medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117275651A (en) * 2023-09-01 2023-12-22 北京华益精点生物技术有限公司 Medical report generation method and device and electronic equipment
CN117149703A (en) * 2023-09-04 2023-12-01 上海易立德信息技术股份有限公司 File processing method and system

Also Published As

Publication number Publication date
CN112733056B (en) 2021-06-18

Similar Documents

Publication Publication Date Title
US8869023B2 (en) Conversion of a collection of data to a structured, printable and navigable format
CN100440222C (en) System and method for text legibility enhancement
CN112733056B (en) Document processing method, device, equipment and storage medium
MX2007011598A (en) Determining fields for presentable files and extensible markup language schemas for bibliographies and citations.
Pierazzo Digital Genetic Editions: The Encoding of Time in Manuscript Transcription 1
CN106294480A (en) A kind of file layout change-over method, device and examination question import system
CN110990539B (en) Manuscript internal duplicate checking method and device and electronic equipment
Kettunen Keep, change or delete? setting up a low resource ocr post-correction framework for a digitized old finnish newspaper collection
CN113221506A (en) Lecture typesetting method and device, electronic equipment and storage medium
CN112433995A (en) File format conversion method, system, computer equipment and storage medium
CN111581937A (en) Document generation method and device, computer readable medium and electronic equipment
CN101464875B (en) Method for representing electronic dictionary catalog data by XML
CN116050370A (en) Template data processing method, system and related equipment
CN116306506A (en) Intelligent mail template method based on content identification
CN113868568A (en) Webpage keyword highlighting method, device, equipment and storage medium
US20050229099A1 (en) Presentation-independent semantic authoring of content
CN112613279A (en) File conversion method and device, computer device and readable storage medium
CN111783482A (en) Text translation method and device, computer equipment and storage medium
CN113539518A (en) Medicine data processing method and device based on RPA and AI and electronic equipment
CN112668282A (en) Method and system for converting format of equipment procedure document
CN111143719A (en) Online publication method, device and equipment of thesis and computer-readable storage medium
CN117975496B (en) Intelligent digitization method for paper publication and related equipment
CN112784780B (en) Review method, review device, computer equipment and storage medium
CN114760365B (en) Data extraction method and device and electronic equipment
CN116009863B (en) Front-end page rendering method, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address

Address after: 100089 room 2356, third floor, building 2, incubator of Dongbeiwang Zhongguancun Software Park, Haidian District, Beijing

Patentee after: Baijiayun Group Co.,Ltd.

Address before: B104, 1st floor, building 12, Zhongguancun Software Park, Haidian District, Beijing 100082

Patentee before: Beijing Baijia Shilian Technology Co.,Ltd.

CP03 Change of name, title or address
TR01 Transfer of patent right

Effective date of registration: 20220507

Address after: 518000 1309, Qianhai Xiangbin building, No. 18, Zimao West Street, Nanshan street, Qianhai Shenzhen Hong Kong cooperation zone, Shenzhen, Guangdong Province

Patentee after: Shenzhen baishilian Technology Co.,Ltd.

Address before: 100089 room 2356, third floor, building 2, incubator of Dongbeiwang Zhongguancun Software Park, Haidian District, Beijing

Patentee before: Baijiayun Group Co.,Ltd.

TR01 Transfer of patent right