WO2017036348A1 - 一种可扩展标记语言xml文档的压缩、解压方法和装置 - Google Patents

一种可扩展标记语言xml文档的压缩、解压方法和装置 Download PDF

Info

Publication number
WO2017036348A1
WO2017036348A1 PCT/CN2016/096790 CN2016096790W WO2017036348A1 WO 2017036348 A1 WO2017036348 A1 WO 2017036348A1 CN 2016096790 W CN2016096790 W CN 2016096790W WO 2017036348 A1 WO2017036348 A1 WO 2017036348A1
Authority
WO
WIPO (PCT)
Prior art keywords
document
mapping
markup language
extensible markup
xml document
Prior art date
Application number
PCT/CN2016/096790
Other languages
English (en)
French (fr)
Inventor
魏强
Original Assignee
阿里巴巴集团控股有限公司
魏强
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司, 魏强 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2017036348A1 publication Critical patent/WO2017036348A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof

Definitions

  • the present application relates to the technical field of computer processing, and in particular to a compression method for an extensible markup language XML document, a decompression method for an extensible markup language XML document, a compression device for an extensible markup language XML document, and a compression device Decompression device for extensible markup language XML documents.
  • XML (Extensible Markup Language) documents can be structured to enable exchanges between departments, customers, and suppliers for dynamic content generation, enterprise integration, and application development.
  • XML documents enable users to search more accurately, more easily transfer application components, and better describe things such as e-commerce transactions.
  • XML documents are usually compressed for easy transfer or storage.
  • the method of compressing the text document is usually GZIP compression.
  • GZIP first compresses a variant of the LZ77 algorithm, and then uses the Huffman encoding method to compress the obtained result.
  • GZIP compression is a universal text compression method, so it is universal, that is, any text document can be compressed using GZIP.
  • this compression method is complicated and time consuming, and is applied when XML document is compressed.
  • the compressed XML document is still large in size and has low compression efficiency.
  • the Huffman coding and the LZ77 algorithm need to be inversely decompressed, and the decompression operation is complicated and time-consuming.
  • the embodiment of the present application discloses a method for compressing an extensible markup language XML document, including:
  • the mapped code is replaced with the document parameters to obtain a compressed extensible markup language XML document.
  • it also includes:
  • it also includes:
  • the document parameters include elements and/or attributes.
  • the step of mapping the document parameter to a mapping code includes:
  • the document parameters after de-reprocessing are mapped to a unique mapping code whose string length is less than or equal to the string length of the document parameter.
  • the step of mapping the document parameters after the deduplication processing into a unique mapping code comprises:
  • the target character string including the candidate character string is extracted from the document parameter, as a new candidate character string, and the execution of the determining whether the candidate character string and the mapped image are performed are returned.
  • the code is the same steps as the code.
  • the step of mapping the document parameters after the de-duplication processing into a unique mapping code further includes:
  • the document parameters after deduplication are sorted according to the length of the string and/or the order of characters.
  • the embodiment of the present application further discloses a decompression method of an extensible markup language XML document, including:
  • the original extensible markup language XML document is obtained by replacing the document parameter with the mapping code according to the mapping relationship.
  • it also includes:
  • mapping relationship When the mapping relationship is embedded in the compressed extensible markup language XML document, the mapping relationship is deleted.
  • the step of obtaining the compressed extensible markup language XML document comprises:
  • the embodiment of the present application further discloses a compression device for an extensible markup language XML document, including:
  • a document parameter reading module for reading document parameters from an original extensible markup language XML document
  • mapping module configured to map the document parameter to a mapping code
  • a document parameter replacement module configured to replace the mapping parameter with the document parameter to obtain a compressed extensible markup language XML document.
  • it also includes:
  • mapping relationship embedding module configured to embed a mapping relationship between the document parameter and the mapping code in the extensible markup language XML document.
  • it also includes:
  • a transport module for transmitting a compressed extensible markup language XML document
  • a storage module for storing compressed Extensible Markup Language XML documents.
  • the document parameters include elements and/or attributes.
  • mapping module includes:
  • a de-sub-module module for performing de-duplication processing on the document parameters
  • a de-mapping sub-module configured to map the document parameters after the de-duplication processing into a unique mapping code, where the string length of the mapping code is less than or equal to a string length of the document parameter.
  • the de-mapping sub-module includes:
  • a candidate string extracting unit configured to extract a candidate character string from a document parameter after the deduplication processing
  • mapping code determining unit configured to determine whether the candidate character string is the same as the mapped mapping code; when not identical, calling the mapping code confirming unit, when the same, calling the target character string extracting unit, and returning the calling mapping code determining unit ;
  • mapping code confirming unit configured to confirm that the candidate character string is a mapping code of the document parameter
  • a target character string extracting unit configured to extract, from the document parameter, a target character string including the candidate character string as a new candidate character string.
  • the de-mapping sub-module further includes:
  • a sorting unit for sorting document parameters after deduplication according to a string length and/or a character order.
  • the embodiment of the present application further discloses a decompression device for an extensible markup language XML document, including:
  • XML document acquisition module for obtaining a compressed extensible markup language XML document, compressed
  • the mapping code is included in the extensible markup language XML document;
  • mapping relationship finding module configured to search for a mapping relationship between a mapping code and a document parameter of the compressed extensible markup language XML document
  • mapping code replacement module configured to replace the document parameter with the mapping code according to the mapping relationship, to obtain an original extensible markup language XML document.
  • it also includes:
  • the mapping relationship deleting module is configured to delete the mapping relationship when the mapping relationship is embedded in the compressed extensible markup language XML document.
  • the XML document obtaining module includes:
  • An XML document receiving sub-module for receiving a compressed Extensible Markup Language XML document that is transmitted.
  • the embodiment of the present application maps the document parameters of the original XML document, replaces the mapping parameters with the document parameters, and implements compression. Since the document parameters are repeated in many cases, the data amount of the XML document storage can be greatly reduced after the replacement, thereby improving The compression efficiency is simple, and the mapping operation is simple, which reduces the time-consuming compression.
  • bandwidth consumption can be reduced, or less storage space can be occupied.
  • the embodiment of the present application maps the document parameters into a unique mapping code, and the uniqueness of the document parameters ensures the uniqueness of the mapping code, thereby ensuring that the two can be mutually converted, and the accuracy of the recovery operation after compression is ensured.
  • the embodiment of the present application sequentially increases the length of the candidate character string for mapping detection, and reduces the length of the mapping code while ensuring the uniqueness of the mapping code.
  • the mapping parameter is used to replace the original document parameter with the mapping code, and the mapping operation Simple, reducing the time spent on understanding pressure.
  • FIG. 1 is a flow chart showing the steps of an embodiment of a method for compressing an extensible markup language XML document of the present application
  • FIG. 2 is a diagram showing an example of the structure of an XML document according to an embodiment of the present application
  • FIG. 3 is a flow chart showing the steps of an embodiment of a method for decompressing an extensible markup language XML document of the present application
  • FIG. 4 is a structural block diagram of an embodiment of a compression device for an extensible markup language XML document of the present application
  • FIG. 5 is a structural block diagram of an embodiment of a decompressing apparatus for an extensible markup language XML document of the present application.
  • FIG. 1 a flow chart of steps of an embodiment of a method for compressing an extensible markup language XML document of the present application is shown, which may specifically include the following steps:
  • Step 101 Read document parameters from an original extensible markup language XML document
  • a general compression method for an XML document is designed for the characteristics of an XML document, and the XML document can be compressed to a large extent.
  • XML document is a text document.
  • the document parameters of an XML document include elements that form a document tree that starts at the root and extends to the bottom of the tree.
  • the element includes a root element (such as a book store) that is the parent of all other elements (such as book, title, author, year, price), and other elements are child elements of the root element.
  • a root element such as a book store
  • other elements such as book, title, author, year, price
  • Sub-elements on the same level become siblings (brothers or sisters, such as title, author, year, price), all of which can have text (such as Harry Potter, J K.Rowling, 2005, 29.99) and attributes (such as lang, category) ).
  • Step 102 Map the document parameter to a mapping code.
  • a book element to represent a book a title element to represent a title, and the like are generally not particularly limited, and this is one of the characteristics of an XML document.
  • document parameters such as elements and attributes can be represented by other simpler mapping codes, and documents such as elements and attributes can be reduced.
  • the number of strings of parameters reduce the amount of transmission, storage, etc., and when used, restore it back for use by the computer.
  • the text of the XML document is data specifically expressed by the user, and generally cannot be falsified.
  • the embodiment of the present application may not compress the data.
  • the compression of document parameters such as elements and attributes is to represent the original element, attribute and other document parameters with as few characters as possible, that is, the length of the string of the mapping code is less than or equal to the length of the string of the document parameter.
  • mapping codes are as short as possible, and the number of characters occupied is small to reduce the compression ratio
  • mapping code is unique to each other, do not repeat, to avoid confusion
  • mapping code can be selected in both directions, that is, the original element, attribute and other document parameters can be found through the mapping code, and the mapping code can also be found through the original element, attribute and other document parameters.
  • the embodiment of the present application may perform deduplication processing on the document parameters, and map the document parameters after the deduplication processing into a unique mapping code.
  • a set may be set to determine whether the extracted document, attribute, and other document parameters are in the set, and if so, ignore, if not, put in the set, so that the elements, attributes, and the like in the set
  • the parameters are unique so that duplicate document elements such as attributes and attributes can be removed.
  • the embodiment of the present application maps the document parameters into a unique mapping code, and the uniqueness of the document parameters ensures the uniqueness of the mapping code, thereby ensuring that the two can be mutually converted, and the accuracy of the recovery operation after compression is ensured.
  • the document parameters after the de-duplication processing may be sorted according to the length of the string and/or the order of the characters. The shorter the length of the string, the earlier the sorting, and the earlier the sorting, the more preferential the mapping.
  • book and year have a string length of 4, and title has a string length of 5, so book and year can be placed before title.
  • book and year have the same string length, and the first character b of the book is in the order of the first character y of the year, so the book can be ranked before year.
  • mapping code For sorted elements such as elements and attributes, you can map them one by one to obtain the mapping code.
  • candidate string is extracted from document parameters after de-duplication processing
  • the target character string containing the candidate character string is extracted from the document parameter, and as a new candidate character string, it is returned whether the judgment candidate character string is the same as the mapped mapping code.
  • This mapping method compares the previously generated mapping code with the previously generated mapping code, and continues to take the substring (target string) until the unique substring is obtained. So far, a unique mapping code is obtained.
  • mapping code is also unique, and can be converted to each other, which is a bidirectional reference set.
  • the candidate string is initially the first character
  • the target string is a string consisting of the candidate string and an adjacent character
  • mapping code F of A is:
  • mapping code of A is the shortest substring that does not exist in the already generated mapping code among all the substrings starting from the first character.
  • mapping code For example, for the attribute book, starting from the first character b (ie, the initial candidate string), check if the mapping code already exists, and if there is no mapping code b, map b to the book. That is, b is the mapping code of the book.
  • mapping code b exists, the next substring bo (ie, the target string) is taken as a new candidate string, and it is also checked whether the mapping code is already present until the mapping code is generated.
  • the first character can be used as a mapping code, and in the worst case, the entire string is used as a mapping code.
  • mapping codes are also unique.
  • mapping is performed in the foregoing manner, and the document parameters after the deduplication processing are sorted according to the length of the string and/or the order of the characters, so that the characters of the mapping code are minimized while ensuring the uniqueness of the mapping code.
  • mapping code b when mapping the book, obtain the mapping code bo to ensure the uniqueness of the two.
  • mapping code is as follows:
  • the embodiment of the present application sequentially increases the length of the candidate character string for mapping detection, and reduces the length of the mapping code while ensuring the uniqueness of the mapping code.
  • mapping method is only an example.
  • other mapping methods may be set according to actual conditions, and the uniqueness of the mapping relationship and the legality of the mapping code may be ensured. Limit it.
  • the mapping code is preset, such as az, AZ, 0-9, and combinations thereof, and the mapping code is directly configured on the document parameters after de-reprocessing.
  • the mapping code is configured only once, to ensure the mapping code.
  • the length of the string is as small as possible, and the mapping code with the string length of 1 is preferentially configured, such as configuring the mapping code a for the first document parameter, configuring the mapping code b for the second document parameter, and the like, configuring the character
  • the mapping code with the string length 2 is configured, and so on, such as configuring the mapping code a0 for the Nth sorted document parameter, configuring the mapping code a1 for the sorting N+1 document parameter, and the like.
  • Step 103 Replace the mapping code with the document parameter to obtain a compressed extensible markup language XML document.
  • mapping code is replaced one by one to implement compression of the XML document.
  • mapping code is replaced with the original document parameters, and the obtained compressed XML document is as follows:
  • the XML document has a large compression, especially when the XML is large, the compression effect is more obvious, and the XML structure itself is not destroyed, and the document parameters and mapping codes of elements, attributes and the like are unique. Recovery can be done while applying.
  • the XML document will have a large number of duplicate document elements such as attributes and attributes.
  • the interface of the cloud disk is transmitted based on an XML document, which usually needs to define a stored file, and there are a large number of document parameters such as a field (characterizing file), a name (character name), and the like.
  • the XML document usually needs to define a stored book, and there are a large number of document parameters such as a book (representation book), a title (character title), an author (representation author), and the like.
  • XML-based specifications such as BPEL, BPMN, have a large number of duplicate elements.
  • XML-based definitions are usually the specification of elements, and in actual use, there are many elements that conform to the specification.
  • an interface to e-commerce is used to get a list of items, and the description of each item is the same, such as item (presentation item), sometimes as many as hundreds.
  • the reduced data amount is (XZ)*Y, and the XML document is stored.
  • the amount of data is greatly reduced, especially when the XML document is large and the number of duplicates of document parameters is large.
  • the average 20M XML document can be compressed to 5M only after being compressed by GZIP.
  • it can be compressed to about 1M, and the volume of the document is greatly reduced. .
  • mapping relationship between the document parameters and the mapping code needs to be used as meta information for recovery.
  • mapping relationship between the document parameters and the mapping code can be embedded in the Extensible Markup Language XML document.
  • mapping relationship is embedded in the XML document as follows:
  • mapping relationship is recorded in the form of "key ⁇ ->value" (key is the document parameter, value is the mapping code), and is embedded in the header of the XML document in the form of an annotation.
  • mapping relationship can also be recorded in other forms, embedded in other locations of the XML document, and even recorded in a separate file, which is not limited by the embodiment of the present application.
  • the volume size of the original Extensible Markup Language XML document and the compressed Extensible Markup Language XML document can be compared.
  • the original Extensible Markup Language XML document is smaller than the compressed Extensible Markup Language XML document, compression can be considered valid and the compressed Extensible Markup Language XML document can be transmitted and/or stored.
  • the compression may be considered invalid.
  • the compression method of the embodiment of the present application may be used for compression.
  • the compressed extensible markup language XML document may be further compressed by using a preset text compression method (such as GZIP) to further improve the compression ratio.
  • a preset text compression method such as GZIP
  • the embodiment of the present application maps the document parameters of the original XML document, replaces the mapping parameters with the document parameters, and implements compression. Since the document parameters are repeated in many cases, the data amount of the XML document storage can be greatly reduced after the replacement, thereby improving The compression efficiency is simple, and the mapping operation is simple, which reduces the time-consuming compression.
  • bandwidth consumption can be reduced, or less storage space can be occupied.
  • FIG. 3 a flow chart of steps of an embodiment of a method for decompressing an extensible markup language XML document of the present application is shown, which may specifically include the following steps:
  • Step 301 Obtain a compressed extensible markup language XML document.
  • the previously stored compressed extensible markup language XML document may be read from the database
  • the compressed XML document If the compressed XML document is transmitted first, it can receive the compressed extension of the incoming transmission. Exhibition markup language XML documents, and more.
  • the compressed extensible markup language XML document includes a mapping code obtained by the original document parameter mapping.
  • the compressed extensible markup language XML document is further compressed by using a preset text compression method (such as GZIP), after obtaining the compressed extensible markup language XML document. , can follow the text compression method (such as GZIP) The compressed extensible markup language XML document is decompressed.
  • a preset text compression method such as GZIP
  • GZIP text compression method
  • Step 302 Search for a mapping relationship between a mapping code and a document parameter of the compressed extensible markup language XML document.
  • mapping relationship can be embedded in an XML document (such as a header)
  • mapping relationship can be read from an XML document (such as a header).
  • mapping the relationship embedded in the XML document is as follows:
  • mapping relationship is embedded in the header of the XML document in the form of an annotation, including:
  • mapping relationship is stored in other manners, such as applying a separate file record, the file may be read in a corresponding manner.
  • the independent file is not limited.
  • Step 303 Replace the document parameter with the mapping code according to the mapping relationship to obtain an original extensible markup language XML document.
  • the document parameter of the mapping code in the compressed XML document may be searched according to the record form of the mapping relationship, and the document parameter is replaced by the mapping parameter for recovery.
  • mapping relationship is recorded in the form of "key ⁇ ->value", where key is a document parameter and value is a mapping code.
  • mapping code b needs to be restored, the mapping relationship with the value b can be found, that is, book ⁇ ->b, replace the key, ie book, with the mapping code b.
  • mapping relationship is embedded in the compressed extensible markup language XML document, the mapping relationship is deleted, and finally the original XML document is obtained for normal use.
  • the mapping parameter is used to replace the original document parameter with the mapping code, and the mapping operation is simple, and the time for understanding the pressure is reduced.
  • FIG. 4 a structural block diagram of an embodiment of a compression device of an extensible markup language XML document of the present application is shown, which may specifically include the following modules:
  • a document parameter reading module 401 configured to read document parameters from an original extensible markup language XML document;
  • mapping module 402 configured to map the document parameter to a mapping code
  • the document parameter replacement module 403 is configured to replace the mapping code with the document parameter to obtain a compressed extensible markup language XML document.
  • the apparatus may further include the following modules:
  • mapping relationship embedding module configured to embed a mapping relationship between the document parameter and the mapping code in the extensible markup language XML document.
  • the apparatus may further include the following modules:
  • a transport module for transmitting a compressed extensible markup language XML document
  • a storage module for storing compressed Extensible Markup Language XML documents.
  • the document parameters can include elements and/or attributes.
  • mapping module 402 may include the following submodules:
  • a de-sub-module module for performing de-duplication processing on the document parameters
  • a de-mapping sub-module configured to map the document parameters after the de-duplication processing into a unique mapping code, where the string length of the mapping code is less than or equal to a string length of the document parameter.
  • the de-mapping sub-module may include the following units:
  • a candidate string extracting unit configured to extract a candidate character string from a document parameter after the deduplication processing
  • mapping code determining unit configured to determine whether the candidate string is associated with the mapped mapping code Same; when not the same, the mapping code confirmation unit is called, when the same, the target string extraction unit is called, and the call mapping code judgment unit is returned;
  • mapping code confirming unit configured to confirm that the candidate character string is a mapping code of the document parameter
  • a target character string extracting unit configured to extract, from the document parameter, a target character string including the candidate character string as a new candidate character string.
  • the de-remapping sub-module may further include the following units:
  • a sorting unit for sorting document parameters after deduplication according to a string length and/or a character order.
  • FIG. 5 a structural block diagram of an embodiment of a decompressing apparatus for an extensible markup language XML document of the present application is shown. Specifically, the following modules may be included:
  • the XML document obtaining module 501 is configured to obtain a compressed extensible markup language XML document, where the compressed extensible markup language XML document includes a mapping code;
  • the mapping relationship searching module 502 is configured to search for a mapping relationship between the mapping code and the document parameter of the compressed extensible markup language XML document.
  • the mapping code replacement module 503 is configured to replace the document parameter with the mapping code according to the mapping relationship to obtain an original extensible markup language XML document.
  • the apparatus may further include the following modules:
  • the mapping relationship deleting module is configured to delete the mapping relationship when the mapping relationship is embedded in the compressed extensible markup language XML document.
  • the XML document obtaining module 501 may include the following sub-modules:
  • An XML document receiving sub-module for receiving a compressed Extensible Markup Language XML document that is transmitted.
  • the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.
  • embodiments of the embodiments of the present application can be provided as a method, apparatus, or computer program product. Therefore, the embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware. Moreover, embodiments of the present application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • the computer device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
  • the memory may include non-persistent memory, random access memory (RAM), and/or non-volatile memory in a computer readable medium, such as read only memory (ROM) or flash memory.
  • RAM random access memory
  • ROM read only memory
  • Memory is an example of a computer readable medium.
  • Computer readable media includes both permanent and non-persistent, removable and non-removable media.
  • Information storage can be implemented by any method or technology. The information can be computer readable instructions, data structures, modules of programs, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory. (ROM), electrically erasable programmable read only memory (EEPROM), flash memory or other memory technology, compact disk read only memory (CD-ROM), digital versatile disk (DVD) or other optical storage, Magnetic tape cartridges, magnetic tape storage or other magnetic storage devices or any other non-transportable media can be used to store information that can be accessed by a computing device.
  • computer readable media does not include non-persistent computer readable media, such as modulated data signals and carrier waves.
  • the embodiments of the present application refer to a method, a terminal device (system), and a meter according to an embodiment of the present application.
  • a flowchart and/or block diagram of a computer program product is described. It will be understood that each flow and/or block of the flowchart illustrations and/or FIG.
  • These computer program instructions can be provided to a processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing terminal device to produce a machine such that instructions are executed by a processor of a computer or other programmable data processing terminal device
  • Means are provided for implementing the functions specified in one or more of the flow or in one or more blocks of the flow chart.
  • the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing terminal device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
  • the instruction device implements the functions specified in one or more blocks of the flowchart or in a flow or block of the flowchart.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Document Processing Apparatus (AREA)

Abstract

一种可扩展标记语言XML文档的压缩、解压方法和装置,该压缩方法包括:从原始的可扩展标记语言XML文档中读取文档参数(101);将所述文档参数映射为映射码(102);将所述映射码替换所述文档参数,获得压缩的可扩展标记语言XML文档(103)。由于文档参数很多情况下存在重复,因此,替换之后可以大大减少XML文档存储的数据量,提高了压缩效率,并且,映射操作简单,减少了压缩的耗时。

Description

一种可扩展标记语言XML文档的压缩、解压方法和装置 技术领域
本申请涉及计算机处理的技术领域,特别是涉及一种可扩展标记语言XML文档的压缩方法、一种可扩展标记语言XML文档的解压方法、一种可扩展标记语言XML文档的压缩装置和一种可扩展标记语言XML文档的解压装置。
背景技术
XML(Extensible Markup Language,可扩展标记语言)文档可以对数据进行结构化处理,从而能够在部门、客户和供应商之间进行交换,实现动态内容生成,企业集成和应用开发。
XML文档可以使用户能够更准确的搜索,更方便的传送应用组件,更好的描述一些事物,如电子商务交易等。
为方便传输或存储,通常对XML文档进行压缩。
由于XML文档是一个文本文档,通常压缩文本文档的方法主要是GZIP压缩,GZIP对于需要压缩的文档,首先使用LZ77算法的一个变种进行压缩,对得到的结果再使用Huffman编码的方法进行压缩。
可见,GZIP压缩是一个通用的文本压缩方法,因此普遍性高,也就是任意一个文本文档都可以使用GZIP压缩,但是,这种压缩方法压缩操作复杂,耗时较高,应用在XML文档压缩时,压缩后的XML文档的体积依然较大,压缩效率较低。
由于XML文档的体积较大,因此,在传输或存储压缩后的XML文档时,会消耗较大的带宽或占用较多的存储空间。
在解压缩时,需要对Huffman编码和LZ77算法进行反向解压缩,解压操作复杂,耗时也较高。
发明内容
鉴于上述问题,提出了本申请实施例以便提供一种克服上述问题或者至 少部分地解决上述问题的一种针对可扩展标记语言XML文档的压缩方法、一种针对可扩展标记语言XML文档的解压方法和相应的一种针对可扩展标记语言XML文档的压缩装置、一种针对可扩展标记语言XML文档的解压装置。
为了解决上述问题,本申请实施例公开了一种可扩展标记语言XML文档的压缩方法,包括:
从原始的可扩展标记语言XML文档中读取文档参数;
将所述文档参数映射为映射码;
将所述映射码替换所述文档参数,获得压缩的可扩展标记语言XML文档。
可选地,还包括:
将所述文档参数与所述映射码之间的映射关系嵌入所述可扩展标记语言XML文档中。
可选地,还包括:
传输和/或存储压缩的可扩展标记语言XML文档。
可选地,所述文档参数包括元素和/或属性。
可选地,所述将所述文档参数映射为映射码的步骤包括:
对所述文档参数进行去重处理;
将去重处理之后的文档参数映射为唯一的映射码,所述映射码的字符串长度小于或等于所述文档参数的字符串长度。
可选地,所述将去重处理之后的文档参数映射为唯一的映射码的步骤包括:
从去重处理之后的文档参数提取候选字符串;
判断所述候选字符串是否与已映射的映射码相同;
当不相同时,确认所述候选字符串为所述文档参数的映射码;
当相同时,从所述文档参数中提取包含所述候选字符串的目标字符串,作为新的候选字符串,返回执行所述判断所述候选字符串是否与已映射的映 射码相同的步骤。
可选地,所述将去重处理之后的文档参数映射为唯一的映射码的步骤还包括:
按照字符串长度和/或字符顺序对去重处理之后的文档参数进行排序。
本申请实施例还公开了一种可扩展标记语言XML文档的解压方法,包括:
获取压缩的可扩展标记语言XML文档,压缩的可扩展标记语言XML文档中包括映射码;
查找压缩的可扩展标记语言XML文档的、映射码与文档参数之间的映射关系;
按照所述映射关系将所述文档参数替换所述映射码,获得原始的可扩展标记语言XML文档。
可选地,还包括:
当所述映射关系嵌在压缩的可扩展标记语言XML文档中时,删除所述映射关系。
可选地,所述获取压缩的可扩展标记语言XML文档的步骤包括:
读取在先存储的压缩的可扩展标记语言XML文档;
或者,
接收传输到来的压缩的可扩展标记语言XML文档。
本申请实施例还公开了一种可扩展标记语言XML文档的压缩装置,包括:
文档参数读取模块,用于从原始的可扩展标记语言XML文档中读取文档参数;
映射模块,用于将所述文档参数映射为映射码;
文档参数替换模块,用于将所述映射码替换所述文档参数,获得压缩的可扩展标记语言XML文档。
可选地,还包括:
映射关系嵌入模块,用于将所述文档参数与所述映射码之间的映射关系嵌入所述可扩展标记语言XML文档中。
可选地,还包括:
传输模块,用于传输压缩的可扩展标记语言XML文档;
和/或,
存储模块,用于存储压缩的可扩展标记语言XML文档。
可选地,所述文档参数包括元素和/或属性。
可选地,所述映射模块包括:
去重子模块,用于对所述文档参数进行去重处理;
去重映射子模块,用于将去重处理之后的文档参数映射为唯一的映射码,所述映射码的字符串长度小于或等于所述文档参数的字符串长度。
可选地,所述去重映射子模块包括:
候选字符串提取单元,用于从去重处理之后的文档参数提取候选字符串;
映射码判断单元,用于判断所述候选字符串是否与已映射的映射码相同;当不相同时,调用映射码确认单元,当相同时,调用目标字符串提取单元,返回调用映射码判断单元;
映射码确认单元,用于确认所述候选字符串为所述文档参数的映射码;
目标字符串提取单元,用于从所述文档参数中提取包含所述候选字符串的目标字符串,作为新的候选字符串。
可选地,所述去重映射子模块还包括:
排序单元,用于按照字符串长度和/或字符顺序对去重处理之后的文档参数进行排序。
本申请实施例还公开了一种可扩展标记语言XML文档的解压装置,包括:
XML文档获取模块,用于获取压缩的可扩展标记语言XML文档,压缩 的可扩展标记语言XML文档中包括映射码;
映射关系查找模块,用于查找压缩的可扩展标记语言XML文档的、映射码与文档参数之间的映射关系;
映射码替换模块,用于按照所述映射关系将所述文档参数替换所述映射码,获得原始的可扩展标记语言XML文档。
可选地,还包括:
映射关系删除模块,用于在所述映射关系嵌在压缩的可扩展标记语言XML文档中时,删除所述映射关系。
可选地,所述XML文档获取模块包括:
XML文档读取子模块,用于读取在先存储的压缩的可扩展标记语言XML文档;
或者,
XML文档接收子模块,用于接收传输到来的压缩的可扩展标记语言XML文档。
本申请实施例包括以下优点:
本申请实施例通过对原始的XML文档的文档参数进行映射,将映射码替换文档参数,实现压缩,由于文档参数很多情况下存在重复,因此,替换之后可以大大减少XML文档存储的数据量,提高了压缩效率,并且,映射操作简单,减少了压缩的耗时。
进而,在传输、存储压缩的XML文档时,可以减少带宽的消耗,或者,较少存储空间的占用。
本申请实施例对文档参数进行去重处理之后映射为唯一的映射码,文档参数的唯一性保证了映射码的唯一性,从而保证两者可以相互转换,保证了压缩之后恢复操作的准确性。
本申请实施例依次增大候选字符串的长度进行映射检测,在保证映射码的唯一性的同时,减少了映射码的长度。
本申请实施例通过映射关系,将原始的文档参数替换映射码,映射操作 简单,减少了解压的耗时。
附图说明
图1是本申请的一种可扩展标记语言XML文档的压缩方法实施例的步骤流程图;
图2是本申请实施例的一种XML文档的结构示例图;
图3是本申请的一种可扩展标记语言XML文档的解压方法实施例的步骤流程图;
图4是本申请的一种可扩展标记语言XML文档的压缩装置实施例的结构框图;
图5是本申请的一种可扩展标记语言XML文档的解压装置实施例的结构框图。
具体实施方式
为使本申请的上述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对本申请作进一步详细的说明。
参照图1,示出了本申请的一种可扩展标记语言XML文档的压缩方法实施例的步骤流程图,具体可以包括如下步骤:
步骤101,从原始的可扩展标记语言XML文档中读取文档参数;
在本申请实施例中,针对XML文档的特性,设计了一套XML文档通用的压缩方法,可以较大幅度的压缩XML文档。
需要说明的是,XML文档是一个文本文档。
以下是一本书的XML文档的示例:
<bookstore>
<book category="COOKING">
  <title lang="en">Everyday Italian</title>
  <author>Giada De Laurentiis</author>
  <year>2005</year>
  <price>30.00</price>
</book>
<book category="CHILDREN">
  <title lang="en">Harry Potter</title>
  <author>J K.Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>
<book category="WEB">
  <title lang="en">Learning XML</title>
  <author>Erik T.Ray</author>
  <year>2003</year>
  <price>39.95</price>
</book>
</bookstore>
如图2所示,XML文档的文档参数包括元素,其可以形成了一棵文档树,这棵树从根部开始,并扩展到树的最底端。
其中,父、子以及同胞等术语用于描述元素之间的关系。
具体而言,该元素包括根元素(如book store)是所有其他元素(如book、title、author、year、price)的父元素,相对而言,其他元素为该根元素的子元素。
相同层级上的子元素成为同胞(兄弟或姐妹,如title、author、year、price),所有元素均可拥有文本(如Harry Potter、J K.Rowling、2005、29.99)和属性(如lang、category)。
在实际应用中,若应用在java语言编程的计算机中,则可以使用自带的DOM和SAX技术可以解析XML文档,读取元素、属性等文档参数。
步骤102,将所述文档参数映射为映射码;
在XML文档里,元素、属性等文档参数用于描述数据,而它们是可以 自定义的,通常为了便于理解,都是使用相应的英文。
例如,用book元素代表书,用title元素代表标题等等,一般无特别的限制,这也是XML文档的特点之一。
但对于计算机来说,计算机不关心元素、属性等文档参数的具体意义,在传输、存储等过程中可以将元素、属性等文档参数用其他更简单的映射码来表示,缩小元素、属性等文档参数的字符串数量,减小传输量、存储量等等,而在使用的时候,再恢复回来供计算机使用即可。
需要说明的是,XML文档的文本是用户具体表达的数据,一般不能进行篡改,本申请实施例可以不对其进行压缩。
元素、属性等文档参数的压缩,即单词的压缩,是用尽可能少的字符代表原始的元素、属性等文档参数,即映射码的字符串长度小于或等于文档参数的字符串长度。
并且,可以恢复原始的元素、属性等文档参数,通常满足以下几个要求:
1、代表元素、属性等文档参数的单词(后文称为映射码)尽可能的短,占用的字符数量少,以降低压缩率;
2、映射码互相唯一,不重复,避免混乱;
3、可以双向选择,即可以通过映射码找到原始的元素、属性等文档参数,也可以通过原始的元素、属性等文档参数找到映射码。
因此,本申请实施例可以对文档参数进行去重处理,将去重处理之后的文档参数映射为唯一的映射码。
在具体实现中,可以设置一个集合,判断提取的元素、属性等文档参数是否在该集合中,若是,则忽略,若否,则放入该集合中,使得该集合中的元素、属性等文档参数具有唯一性,这样就可以去除重复的元素、属性等文档参数。
例如,属性是title,元素也是title,那么就会只有一个title在集合中。
本申请实施例对文档参数进行去重处理之后映射为唯一的映射码,文档参数的唯一性保证了映射码的唯一性,从而保证两者可以相互转换,保证了压缩之后恢复操作的准确性。
进一步而言,可以按照字符串长度和/或字符顺序对去重处理之后的文档参数进行排序,字符串长度越短,排序越前,而排序越前,越优先进行映射。
例如,book和year的字符串长度为4,title的字符串长度为5,因此,book和year可以排在title之前。
对同字符串长度的文档参数,可以按字符的顺序进行排序,
例如,book和year的字符串长度相同,而book的第一个字符b比year的第一个字符y的顺序要前,因此,book可以排在year之前。
对于排序好的的元素、属性等文档参数,则可以逐个进行映射,获得映射码。
在一种映射方式中,从去重处理之后的文档参数提取候选字符串;
判断候选字符串是否与已映射的映射码相同;
当不相同时,确认候选字符串为文档参数的映射码;
当相同时,从文档参数中提取包含候选字符串的目标字符串,作为新的候选字符串,返回判断候选字符串是否与已映射的映射码相同。
此种映射方式,在每次取文档参数的子串(候选字符串)时,都会与之前生成的映射码进行比较,相同时会继续取子串(目标字符串),直到取得唯一的子串为止,获得唯一的映射码。
并且,由于去重处理之后的文档参数是唯一的,映射码也是唯一的,就可以互相转换了,是一个双向的引用集合。
为了尽可能减少映射码的字符的数量,通常,候选字符串初始为第一个字符,目标字符串为由候选字符串与相邻的一个字符组成的字符串。
假设已经生成的映射码集合为M,文档参数为A,那么A的映射码F为:
F(A)=Min(A.substr(0,[1-A.length])not in M)
即A的映射码为,由第一个字符开始的所有子串中,在已经生成的映射码中不存在的最短子串。
例如,对于属性book,从第一个字符b(即初始的候选字符串)开始,检查是否已经有这个映射码,如果不存在映射码b,就将b与book进行映射, 即b作为book的映射码。
如果存在映射码b,就取下一个子串bo(即目标字符串),作为新的候选字符串,同样检查是否已经有这个映射码,直至生成映射码为止。
理想情况下,第一个字符即可作为映射码,最差的情况,是使用整个字符串作为映射码。
由于去重处理已经使得文档参数本身具有唯一性,因此,其映射码也具有唯一性。
应用上述方式进行映射,按照字符串长度和/或字符顺序对去重处理之后的文档参数进行排序,可以在保证映射码的唯一性的情况下,使得映射码的字符尽量少。
例如,若同时对b与book进行映射,若先映射book,则会获得映射码b,再对b进行映射时,无法映射出唯一的映射码;反之,若先映射b,则会获得映射码b,再对book进行映射时,获得映射码bo,保证两者的唯一性。
对于上述书的XML文档的示例,提取的元素、属性如下:
bookstore
book
category
title
lang
author
year
price
排序之后的元素、属性如下:
book
lang
year
price
title
author
category
bookstore
按照取最短子串的规则取映射码如下:
book<->b
lang<->l
year<->y
price<->p
title<->t
author<->a
category<->c
bookstore<->bo
本申请实施例依次增大候选字符串的长度进行映射检测,在保证映射码的唯一性的同时,减少了映射码的长度。
当然,上述映射方法只是作为示例,在实施本申请实施例时,可以根据实际情况设置其他映射方法,可以保证映射关系的唯一性,以及映射码的合法性即可,本发明实施例对此不加以限制。
例如,预先设定映射码,如a-z、A-Z、0-9及其组合等等,对去重处理之后的文档参数直接配置映射码,为保证唯一性,映射码仅配置一次,为保证映射码的字符串长度尽可能少,优先配置字符串长度为1的映射码,如对排序第一的文档参数配置映射码a,对排序第二的文档参数配置映射码b,等等,配置完字符串长度为1映射码的之后,配置字符串长度为2的映射码,如此类推,如对排序第N的文档参数配置映射码a0、对排序第N+1的文档参数配置映射码a1等等。
另外,除了上述判断处理方法外,本领域技术人员还可以根据实际需要采用其它判断处理方法,本发明实施例对此也不加以限制。
步骤103,将所述映射码替换所述文档参数,获得压缩的可扩展标记语言XML文档。
在本申请实施例中,将映射码逐个进行替换,实现XML文档的压缩。
对于上述书的XML文档的示例,将映射码替换原始的文档参数,获得的压缩的XML文档如下:
<bo>
<b c="COOKING">
  <t l="en">Everyday Italian</t>
  <a>Giada De Laurentiis</a>
  <y>2005</y>
  <p>30.00</p>
</b>
<b c="CHILDREN">
  <t l="en">Harry Potter</t>
  <a>J K.Rowling</a>
  <y>2005</u>
  <p>29.99</p>
</b>
<b c="WEB">
  <t lang="en">Learning XML</t>
  <a>Erik T.Ray</a>
  <y>2003</y>
  <p>39.95</p>
</b>
</bo>
可以看到,XML文档有大幅的压缩,尤其当XML很大时,压缩效果就更为明显了,并且,XML结构本身没有被破坏,且元素、属性等文档参数与映射码都具有唯一性,在应用时可以进行恢复。
在实际应用中,很多情况下需要重复对相同的对象进行定义,XML文档会有大量重复的元素、属性等文档参数。
例如,云盘的接口基于XML文档进行传输,该XML文档通常需要定义存储的文件,会存在大量的field(表征文件)、name(表征名称)等元素、属性等文档参数。
又例如,在图书馆的数据库中,该XML文档通常需要定义存储的书本,会存在大量的book(表征书本)、title(表征标题)、author(表征作者)等元素、属性等文档参数。
此外,应用XML文档也会有大量重复的元素。
例如,基于XML的一些规范,如BPEL,BPMN,都有大量重复的元素。
基于XML的定义的通常都是元素的规范,而实际使用中,符合规范的元素都是很多的。
例如,电子商务的某个接口用于获取商品列表,每个商品的描述是一样的,如item(表征项目),有时候多达数百个。
应用本申请实施例进行压缩,假设原始的文档参数的字符串长度为X,数量为Y,其映射码的字符串长度为Z,则减少的数据量为(X-Z)*Y,XML文档存储的数据量就大大减少了,在XML文档较大、文档参数的重复数量较多时尤为明显。
在某项实验数据表明,平均20M的XML文档,只经过GZIP压缩,可以压缩到5M,但经过本申请实施例的压缩方法进行压缩,可以压缩到1M左右,文档的体积有较大幅度的降低。
除此之外,文档参数与映射码之间的映射关系需要作为元信息,用于恢复。
在一种情况中,可以将文档参数与映射码之间的映射关系嵌入可扩展标记语言XML文档中。
对于上述书的XML文档的示例,映射关系嵌入XML文档如下:
<!—
book<->b
lang<->l
year<->y
price<->p
title<->t
author<->a
category<->c
bookstore<->bo
-->
<bo>
<b c="COOKING">
  <t l="en">Everyday Italian</t>
  <a>Giada De Laurentiis</a>
  <y>2005</y>
  <p>30.00</p>
</b>
<b c="CHILDREN">
  <t l="en">Harry Potter</t>
  <a>J K.Rowling</a>
  <y>2005</u>
  <p>29.99</p>
</b>
<b c="WEB">
  <t lang="en">Learning XML</t>
  <a>Erik T.Ray</a>
  <y>2003</y>
  <p>39.95</p>
</b>
</bo>
在此示例中,映射关系以“key<->value”(key为文档参数,value为映射码)的形式进行记录,以注解的形式嵌入XML文档的头部。
当然,映射关系也可以以其他形式进行记录,嵌入XML文档的其他位置,甚至,可以以独立的文件记录,本申请实施例对此不加以限制。
在实际应用中,很多开源的数据库都是用XML文档进行存储的,某些电子商务平台上接口的返回请求参数也是基于XML文档的,等等。
在压缩XML文档之后,为保证压缩有效,则可以对比原始的可扩展标记语言XML文档与压缩的可扩展标记语言XML文档的体积大小。
若原始的可扩展标记语言XML文档比压缩的可扩展标记语言XML文档的体积小,则可以认为压缩有效,可以传输和/或存储压缩的可扩展标记语言XML文档。
若原始的可扩展标记语言XML文档比压缩的可扩展标记语言XML文档的体积大,则可以认为压缩无效,针对此可扩展标记语言XML文档,可以不应用本申请实施例的压缩方法进行压缩。
需要说明的是,在传输和/或存储之前,还可以采用预设的文本压缩方式(如GZIP)对压缩的可扩展标记语言XML文档进一步进行压缩,进一步提高压缩率。
本申请实施例通过对原始的XML文档的文档参数进行映射,将映射码替换文档参数,实现压缩,由于文档参数很多情况下存在重复,因此,替换之后可以大大减少XML文档存储的数据量,提高了压缩效率,并且,映射操作简单,减少了压缩的耗时。
进而,在传输、存储压缩的XML文档时,可以减少带宽的消耗,或者,较少存储空间的占用。
参照图3,示出了本申请的一种可扩展标记语言XML文档的解压方法实施例的步骤流程图,具体可以包括如下步骤:
步骤301,获取压缩的可扩展标记语言XML文档;
在具体实现中,若压缩的XML文档在先存储在数据库,则可以从数据库读取在先存储的压缩的可扩展标记语言XML文档;
若压缩的XML文档在先进行传输,则可以接收传输到来的压缩的可扩 展标记语言XML文档,等等。
在本申请实施例中,压缩的可扩展标记语言XML文档中包括映射码,该映射码为由原始的文档参数映射获得。
例如,某本书的压缩的XML文档的示例如下:
<bo>
<b c="COOKING">
  <t l="en">Everyday Italian</t>
  <a>Giada De Laurentiis</a>
  <y>2005</y>
  <p>30.00</p>
</b>
<b c="CHILDREN">
  <t l="en">Harry Potter</t>
  <a>J K.Rowling</a>
  <y>2005</u>
  <p>29.99</p>
</b>
<b c="WEB">
  <t lang="en">Learning XML</t>
  <a>Erik T.Ray</a>
  <y>2003</y>
  <p>39.95</p>
</b>
</bo>
其中,b、l、y、p、t、a、c、bo均为映射码。
需要说明的是,在传输和/或存储之前,采用了预设的文本压缩方式(如GZIP)对压缩的可扩展标记语言XML文档进一步进行压缩,则在获取压缩的可扩展标记语言XML文档之后,可以按照该文本压缩方式(如GZIP)对 压缩的可扩展标记语言XML文档进进行解压。
步骤302,查找压缩的可扩展标记语言XML文档的、映射码与文档参数之间的映射关系;
若该映射关系可以嵌入XML文档中(如头部),则可以从XML文档中(如头部)读取该映射关系。
例如,对于上述书的压缩的XML文档,映射关系嵌入XML文档的示例如下:
<!—
book<->b
lang<->l
year<->y
price<->p
title<->t
author<->a
category<->c
bookstore<->bo
-->
<bo>
<b c="COOKING">
  <t l="en">Everyday Italian</t>
  <a>Giada De Laurentiis</a>
  <y>2005</y>
  <p>30.00</p>
</b>
<b c="CHILDREN">
  <t l="en">Harry Potter</t>
  <a>J K.Rowling</a>
  <y>2005</u>
  <p>29.99</p>
</b>
<b c="WEB">
  <t lang="en">Learning XML</t>
  <a>Erik T.Ray</a>
  <y>2003</y>
  <p>39.95</p>
</b>
</bo>
其中,映射关系以注解的形式嵌入XML文档的头部,包括:
book<->b
lang<->l
year<->y
price<->p
title<->t
author<->a
category<->c
bookstore<->bo
当然,若该映射关系采用其他方式存储,如应用独立的文件记录,则可以采用相应的方式进行读取,如查找该独立的文件,本申请实施例对此不加以限制。
步骤303,按照所述映射关系将所述文档参数替换所述映射码,获得原始的可扩展标记语言XML文档。
在具体实现中,可以按照映射关系的记录形式,查找压缩XML文档中映射码的文档参数,将该文档参数替换该映射码,进行恢复。
例如,对于上述书的压缩的XML文档的示例,映射关系以“key<->value”的形式进行记录,其中,key为文档参数,value为映射码。
若需要对映射码b恢复,则可以查找到value为b的映射关系,即book <->b,将key,即book替换映射码b。
则恢复后的书的XML的示例如下:
<bookstore>
<book category="COOKING">
  <title lang="en">Everyday Italian</title>
  <author>Giada De Laurentiis</author>
  <year>2005</year>
  <price>30.00</price>
</book>
<book category="CHILDREN">
  <title lang="en">Harry Potter</title>
  <author>J K.Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>
<book category="WEB">
  <title lang="en">Learning XML</title>
  <author>Erik T.Ray</author>
  <year>2003</year>
  <price>39.95</price>
</book>
</bookstore>
需要说明的是,当映射关系嵌在压缩的可扩展标记语言XML文档中时,删除映射关系,最终获得原始的XML文档,进行正常使用。
本申请实施例通过映射关系,将原始的文档参数替换映射码,映射操作简单,减少了解压的耗时。
需要说明的是,对于方法实施例,为了简单描述,故将其都表述为一系 列的动作组合,但是本领域技术人员应该知悉,本申请实施例并不受所描述的动作顺序的限制,因为依据本申请实施例,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作并不一定是本申请实施例所必须的。
参照图4,示出了本申请的一种可扩展标记语言XML文档的压缩装置实施例的结构框图,具体可以包括如下模块:
文档参数读取模块401,用于从原始的可扩展标记语言XML文档中读取文档参数;
映射模块402,用于将所述文档参数映射为映射码;
文档参数替换模块403,用于将所述映射码替换所述文档参数,获得压缩的可扩展标记语言XML文档。
在本申请的一个实施例中,该装置还可以包括如下模块:
映射关系嵌入模块,用于将所述文档参数与所述映射码之间的映射关系嵌入所述可扩展标记语言XML文档中。
在本申请的一个实施例中,该装置还可以包括如下模块:
传输模块,用于传输压缩的可扩展标记语言XML文档;
和/或,
存储模块,用于存储压缩的可扩展标记语言XML文档。
在具体实现中,所述文档参数可以包括元素和/或属性。
在本申请的一个实施例中,所述映射模块402可以包括如下子模块:
去重子模块,用于对所述文档参数进行去重处理;
去重映射子模块,用于将去重处理之后的文档参数映射为唯一的映射码,所述映射码的字符串长度小于或等于所述文档参数的字符串长度。
在本申请的一个示例中,所述去重映射子模块可以包括如下单元:
候选字符串提取单元,用于从去重处理之后的文档参数提取候选字符串;
映射码判断单元,用于判断所述候选字符串是否与已映射的映射码相 同;当不相同时,调用映射码确认单元,当相同时,调用目标字符串提取单元,返回调用映射码判断单元;
映射码确认单元,用于确认所述候选字符串为所述文档参数的映射码;
目标字符串提取单元,用于从所述文档参数中提取包含所述候选字符串的目标字符串,作为新的候选字符串。
在本申请的另一个示例中,所述去重映射子模块还可以包括如下单元:
排序单元,用于按照字符串长度和/或字符顺序对去重处理之后的文档参数进行排序。
参照图5,示出了本申请的一种可扩展标记语言XML文档的解压装置实施例的结构框图,具体可以包括如下模块:
XML文档获取模块501,用于获取压缩的可扩展标记语言XML文档,压缩的可扩展标记语言XML文档中包括映射码;
映射关系查找模块502,用于查找压缩的可扩展标记语言XML文档的、映射码与文档参数之间的映射关系;
映射码替换模块503,用于按照所述映射关系将所述文档参数替换所述映射码,获得原始的可扩展标记语言XML文档。
在本申请的一个实施例中,该装置还可以包括如下模块:
映射关系删除模块,用于在所述映射关系嵌在压缩的可扩展标记语言XML文档中时,删除所述映射关系。
在本申请的一个实施例中,所述XML文档获取模块501可以包括如下子模块:
XML文档读取子模块,用于读取在先存储的压缩的可扩展标记语言XML文档;
或者,
XML文档接收子模块,用于接收传输到来的压缩的可扩展标记语言XML文档。
对于装置实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。
本领域内的技术人员应明白,本申请实施例的实施例可提供为方法、装置、或计算机程序产品。因此,本申请实施例可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请实施例可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
在一个典型的配置中,所述计算机设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括非持续性的电脑可读媒体(transitory media),如调制的数据信号和载波。
本申请实施例是参照根据本申请实施例的方法、终端设备(***)、和计 算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理终端设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理终端设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理终端设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理终端设备上,使得在计算机或其他可编程终端设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程终端设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
尽管已描述了本申请实施例的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例做出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本申请实施例范围的所有变更和修改。
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者终端设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者终端设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者终 端设备中还存在另外的相同要素。
以上对本申请所提供的一种可扩展标记语言XML文档的压缩方法、一种可扩展标记语言XML文档的解压方法、一种可扩展标记语言XML文档的压缩装置和一种可扩展标记语言XML文档的解压装置,进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (17)

  1. 一种可扩展标记语言XML文档的压缩方法,其特征在于,包括:
    从原始的可扩展标记语言XML文档中读取文档参数;
    将所述文档参数映射为映射码;
    将所述映射码替换所述文档参数,获得压缩的可扩展标记语言XML文档。
  2. 根据权利要求1所述的方法,其特征在于,还包括:
    将所述文档参数与所述映射码之间的映射关系嵌入所述可扩展标记语言XML文档中。
  3. 根据权利要求1或2所述的方法,其特征在于,还包括:
    传输和/或存储压缩的可扩展标记语言XML文档。
  4. 根据权利要求1所述的方法,其特征在于,所述文档参数包括元素和/或属性。
  5. 根据权利要求1或2或4所述的方法,其特征在于,所述将所述文档参数映射为映射码的步骤包括:
    对所述文档参数进行去重处理;
    将去重处理之后的文档参数映射为唯一的映射码,所述映射码的字符串长度小于或等于所述文档参数的字符串长度。
  6. 根据权利要求5所述的方法,其特征在于,所述将去重处理之后的文档参数映射为唯一的映射码的步骤包括:
    从去重处理之后的文档参数提取候选字符串;
    判断所述候选字符串是否与已映射的映射码相同;
    当不相同时,确认所述候选字符串为所述文档参数的映射码;
    当相同时,从所述文档参数中提取包含所述候选字符串的目标字符串,作为新的候选字符串,返回执行所述判断所述候选字符串是否与已映射的映射码相同的步骤。
  7. 根据权利要求6所述的方法,其特征在于,所述将去重处理之后的 文档参数映射为唯一的映射码的步骤还包括:
    按照字符串长度和/或字符顺序对去重处理之后的文档参数进行排序。
  8. 一种可扩展标记语言XML文档的解压方法,其特征在于,包括:
    获取压缩的可扩展标记语言XML文档,压缩的可扩展标记语言XML文档中包括映射码;
    查找压缩的可扩展标记语言XML文档的、映射码与文档参数之间的映射关系;
    按照所述映射关系将所述文档参数替换所述映射码,获得原始的可扩展标记语言XML文档。
  9. 根据权利要求8所述的方法,其特征在于,还包括:
    当所述映射关系嵌在压缩的可扩展标记语言XML文档中时,删除所述映射关系。
  10. 根据权利要求8或9所述的方法,其特征在于,所述获取压缩的可扩展标记语言XML文档的步骤包括:
    读取在先存储的压缩的可扩展标记语言XML文档;
    或者,
    接收传输到来的压缩的可扩展标记语言XML文档。
  11. 一种可扩展标记语言XML文档的压缩装置,其特征在于,包括:
    文档参数读取模块,用于从原始的可扩展标记语言XML文档中读取文档参数;
    映射模块,用于将所述文档参数映射为映射码;
    文档参数替换模块,用于将所述映射码替换所述文档参数,获得压缩的可扩展标记语言XML文档。
  12. 根据权利要求11所述的装置,其特征在于,还包括:
    映射关系嵌入模块,用于将所述文档参数与所述映射码之间的映射关系嵌入所述可扩展标记语言XML文档中。
  13. 根据权利要求1或2所述的装置,其特征在于,所述映射模块包括:
    去重子模块,用于对所述文档参数进行去重处理;
    去重映射子模块,用于将去重处理之后的文档参数映射为唯一的映射码,所述映射码的字符串长度小于或等于所述文档参数的字符串长度。
  14. 根据权利要求13所述的装置,其特征在于,所述去重映射子模块包括:
    候选字符串提取单元,用于从去重处理之后的文档参数提取候选字符串;
    映射码判断单元,用于判断所述候选字符串是否与已映射的映射码相同;当不相同时,调用映射码确认单元,当相同时,调用目标字符串提取单元,返回调用映射码判断单元;
    映射码确认单元,用于确认所述候选字符串为所述文档参数的映射码;
    目标字符串提取单元,用于从所述文档参数中提取包含所述候选字符串的目标字符串,作为新的候选字符串。
  15. 根据权利要求14所述的装置,其特征在于,所述去重映射子模块还包括:
    排序单元,用于按照字符串长度和/或字符顺序对去重处理之后的文档参数进行排序。
  16. 一种可扩展标记语言XML文档的解压装置,其特征在于,包括:
    XML文档获取模块,用于获取压缩的可扩展标记语言XML文档,压缩的可扩展标记语言XML文档中包括映射码;
    映射关系查找模块,用于查找压缩的可扩展标记语言XML文档的、映射码与文档参数之间的映射关系;
    映射码替换模块,用于按照所述映射关系将所述文档参数替换所述映射码,获得原始的可扩展标记语言XML文档。
  17. 根据权利要求16所述的装置,其特征在于,还包括:
    映射关系删除模块,用于在所述映射关系嵌在压缩的可扩展标记语言XML文档中时,删除所述映射关系。
PCT/CN2016/096790 2015-09-06 2016-08-25 一种可扩展标记语言xml文档的压缩、解压方法和装置 WO2017036348A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510561440.9A CN106503003A (zh) 2015-09-06 2015-09-06 一种可扩展标记语言xml文档的压缩、解压方法和装置
CN201510561440.9 2015-09-06

Publications (1)

Publication Number Publication Date
WO2017036348A1 true WO2017036348A1 (zh) 2017-03-09

Family

ID=58186698

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/096790 WO2017036348A1 (zh) 2015-09-06 2016-08-25 一种可扩展标记语言xml文档的压缩、解压方法和装置

Country Status (2)

Country Link
CN (1) CN106503003A (zh)
WO (1) WO2017036348A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111797596A (zh) * 2020-05-18 2020-10-20 冠群信息技术(南京)有限公司 一种可扩展标记语言xml文档的压缩、解压方法和装置

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109271247A (zh) * 2017-07-12 2019-01-25 珠海市魅族科技有限公司 内存优化方法、装置、计算机装置以及存储介质
CN108256017B (zh) * 2018-01-08 2020-12-15 武汉斗鱼网络科技有限公司 一种用于数据存储的方法、装置及计算机设备
CN108233942B (zh) * 2018-01-08 2022-02-22 武汉斗鱼网络科技有限公司 一种用于数据存储的方法、装置及计算机设备
CN112214461B (zh) * 2020-10-12 2022-09-30 河南大学 一种遥感元数据的模糊xml压缩方法
CN113111290A (zh) * 2021-04-29 2021-07-13 北京房江湖科技有限公司 用于生成ui界面的方法和装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6883137B1 (en) * 2000-04-17 2005-04-19 International Business Machines Corporation System and method for schema-driven compression of extensible mark-up language (XML) documents
CN1635492A (zh) * 2003-12-30 2005-07-06 皇家飞利浦电子股份有限公司 一种xml数据的压缩与解压缩方法及装置
CN102096704A (zh) * 2010-12-29 2011-06-15 北京新媒传信科技有限公司 一种xml的压缩方法和装置
CN103605730A (zh) * 2013-11-19 2014-02-26 山西三恒自动化设备有限公司 一种基于不定长标识码的xml的压缩方法和装置
CN104484337A (zh) * 2014-11-19 2015-04-01 西安电子科技大学 Xml文档的存储方法

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020099712A1 (en) * 2001-01-23 2002-07-25 Neo-Core, L.L.C. Method of operating an extensible markup language database
CN1492322A (zh) * 2003-08-20 2004-04-28 放 黄 xml数据压缩和解压方法
US7171430B2 (en) * 2003-08-28 2007-01-30 International Business Machines Corporation Method and system for processing structured documents in a native database
CN101222476B (zh) * 2007-01-08 2010-09-29 华为技术有限公司 一种可扩展标记语言文件编辑器、文件传输方法及***
US9058344B2 (en) * 2013-01-31 2015-06-16 International Business Machines Corporation Supporting flexible types in a database
CN104268143B (zh) * 2014-08-08 2017-10-20 华迪计算机集团有限公司 Xml数据的处理方法和装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6883137B1 (en) * 2000-04-17 2005-04-19 International Business Machines Corporation System and method for schema-driven compression of extensible mark-up language (XML) documents
CN1635492A (zh) * 2003-12-30 2005-07-06 皇家飞利浦电子股份有限公司 一种xml数据的压缩与解压缩方法及装置
CN102096704A (zh) * 2010-12-29 2011-06-15 北京新媒传信科技有限公司 一种xml的压缩方法和装置
CN103605730A (zh) * 2013-11-19 2014-02-26 山西三恒自动化设备有限公司 一种基于不定长标识码的xml的压缩方法和装置
CN104484337A (zh) * 2014-11-19 2015-04-01 西安电子科技大学 Xml文档的存储方法

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111797596A (zh) * 2020-05-18 2020-10-20 冠群信息技术(南京)有限公司 一种可扩展标记语言xml文档的压缩、解压方法和装置

Also Published As

Publication number Publication date
CN106503003A (zh) 2017-03-15

Similar Documents

Publication Publication Date Title
WO2017036348A1 (zh) 一种可扩展标记语言xml文档的压缩、解压方法和装置
US11036808B2 (en) System and method for indexing electronic discovery data
US9805080B2 (en) Data driven relational algorithm formation for execution against big data
US11494339B2 (en) Multi-level compression for storing data in a data store
US20230342403A1 (en) Method and system for document similarity analysis
US8782101B1 (en) Transferring data across different database platforms
US20200285666A1 (en) Media Search Processing Using Partial Schemas
JP7153420B2 (ja) データベース中にグラフ情報を記憶するためのb木使用
US11663177B2 (en) Systems and methods for extracting data in column-based not only structured query language (NoSQL) databases
CN110674087A (zh) 文件查询方法、装置及计算机可读存储介质
CN113672204A (zh) 一种接口文档生成方法、***、电子设备及存储介质
CN110162412B (zh) 在客户端进行数据操作的方法和装置
EP2856359B1 (en) Systems and methods for storing data and eliminating redundancy
US10754859B2 (en) Encoding edges in graph databases
CN113743432A (zh) 一种图像实体信息获取方法、设备、电子设备和存储介质
CN115080684B (zh) 网盘文档索引方法、装置、网盘及存储介质
US10127208B2 (en) Document conversion device, document conversion method, and recording medium
CN113505153B (zh) 一种基于iOS***的备忘录备份方法和相关设备
US11550777B2 (en) Determining metadata of a dataset
CN108874941B (zh) 基于卷积特征和多重哈希映射的大数据url去重方法
CN107818121B (zh) 一种html文件压缩方法、装置及电子设备
CN110647568A (zh) 一种图数据库数据转化为编程语言数据方法及装置
CN110727672A (zh) 数据映射关系查询方法、装置、电子设备及可读介质
US10325106B1 (en) Apparatus and method for operating a triple store database with document based triple access security
JP2013196205A (ja) データモデル変換プログラム、データモデル変換方法およびデータモデル変換装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16840777

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16840777

Country of ref document: EP

Kind code of ref document: A1