Results 1 -
4 of
4
ParaXML: A Parallel XML Processing Model on the Multicore CPUs
"... performance and scale well on a multicore machine. XML has emerged as the de facto standard interoperable data format for the web service, the database and document processing systems. The processing of the XML documents, however, has been recognized as the performance bottleneck in those systems; a ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
performance and scale well on a multicore machine. XML has emerged as the de facto standard interoperable data format for the web service, the database and document processing systems. The processing of the XML documents, however, has been recognized as the performance bottleneck in those systems; as a result the demand for highperformance XML processing grows rapidly. On the hardware front, the multicore processor is increasingly becoming available on desktop-computing machines with quadcore shipping now and 16 core system within two or three years. Unfortunately almost all of the present XML processing algorithms are still using serial processing model, thus being unable to take advantage of the multicore resource. We believe a parallel XML processing model should be a cost-effective solution for the XML performance issue in the multicore era. In this paper, we present a generalpurpose parallel XML processing model, ParaXML, designed for multicore CPUs. General speaking, ParaXML treats the XML document as the general tree structure and the XML processing task as the extension from the parallel tree traversal algorithm for the classic discrete optmization problems. The XML processing, however, has quite distinct characteristics from the classic discrete optmization problems, thus demanding the special treatments and the finegrained tuning technologies. ParaXML internally adopts a fine-grained work-stealing scheme to dynamically control the load balance among the parallel-running threads, and a novel approach is also introduced to trace the stealing actions and the running results to facilitate the reducing of those parallel-running results. Besides, ParaXML provides the tuning options, particularly for the large XML documents, to control the trade-off between the parallelism gain and task-partitioning overhead. To show the feasibility and effectiveness of the ParaXML model, we demonstrate our parallel implementations of three fundamental XML processing tasks based on the ParaXML: traversal, serializing and parsing. The empirical study in this paper shows that those parallel implementations substantially improved the 1
Parallel Processing of Large-Scale XML-Based Application Documents on Multi-core Architectures with PiXiMaL
"... Very large scientific datasets are becoming increasingly available in XML formats. Our earlier benchmarking results show that parsing XML is a time consuming process when compared with binary formats optimized for large scale documents. This performance bottleneck will get exacerbated as size of XML ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Very large scientific datasets are becoming increasingly available in XML formats. Our earlier benchmarking results show that parsing XML is a time consuming process when compared with binary formats optimized for large scale documents. This performance bottleneck will get exacerbated as size of XML data increases in e-science applications. Our focus in this paper is on addressing this performance bottleneck. In recent times, the microprocessor industry has made rapid strides towards Chip Multi Processors (CMPs). The widely available XML parsers have been unable to take advantage of the opportunities presented by CMPs, instead, passing the task of parallelization to the application programmer. The paradigms used thusfar to process large size XML documents on uni-processors are not applicable for CMPs. We present the design, implementation, and performance analysis of PiXiMaL, a parallel processing library for large-scale XML-data files. In particular, we discuss an effective scheme to parallelize the tokenization process to achieve an overall performance increase when parsing large-scale XML documents that are increasingly in use today. Our approach is to build a DFAbased parser that recognizes a useful subset of the XML specification and converts the DFA into an NFA which can be applied on any subset of the input.
Approaching a Parallelized XML Parser Optimized for Multi-Core Processors ∗ ABSTRACT
"... Very large scientific datasets are increasingly becoming available in XML formats. At the same time, multi-core processing is increasingly becoming available on desktop- and laptop-class computing machines. Unfortunately, most XML parsers are still using algorithms that are inherently serial, which ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Very large scientific datasets are increasingly becoming available in XML formats. At the same time, multi-core processing is increasingly becoming available on desktop- and laptop-class computing machines. Unfortunately, most XML parsers are still using algorithms that are inherently serial, which show little improvement on newer computing hardware. The current XML implementation landscape does not adequately meet the performance requirements of large scale applications. Thus far, applications using Web services (in the grid community, for example) have largely focused on XML protocol standardization and tool building efforts, and not on addressing the performance bottlenecks when dealing with large volumes of XML data. Generic parallel parsing has been studied in depth over the past thirty years. However, as yet, these results have not been applied to the problem of XML parsing. XML documents have some structural properties that make it more amenable to parallelized parsing than general context-free languages. As has been previously shown, XML parsers spend a large percentage of time tokenizing the input in an inherently serial process, typically running a deterministic finite automaton on the input. Our initial approach, described here, separates the process of parsing the XML from the process of reading the input. We take a well-known high performance parser, Piccolo, and apply two different strategies, Runahead and Piped, and examine the timing of the file read time and hence the overall time to parse large scientific XML files. Under the conditions tested here, performance decreases.
A Data Parallel Algorithm for XML DOM Parsing
"... Abstract. The extensible markup language XML has become the de facto standard for information representation and interchange on the Internet. XML parsing is a core operation performed on an XML document for it to be accessed and manipulated. This operation is known to cause performance bottlenecks i ..."
Abstract
- Add to MetaCart
Abstract. The extensible markup language XML has become the de facto standard for information representation and interchange on the Internet. XML parsing is a core operation performed on an XML document for it to be accessed and manipulated. This operation is known to cause performance bottlenecks in applications and systems that process large volumes of XML data. We believe that parallelism is a natural way to boost performance. Leveraging multicore processors can offer a cost-effective solution, because future multicore processors will support hundreds of cores, and will offer a high degree of parallelism in hardware. We propose a data parallel algorithm called ParDOM for XML DOM parsing, that builds an in-memory tree structure for an XML document. ParDOM has two phases. In the first phase, an XML document is partitioned into chunks and parsed in parallel. In the second phase, partial DOM node tree structures created during the first phase, are linked together (in parallel) to build a complete DOM node tree. ParDOM offers fine-grained parallelism by adopting a flexible chunking scheme – each chunk can contain an arbitrary number of start and end XML tags that are not necessarily matched. ParDOM can be conveniently implemented using a data parallel programming model that supports map and sort operations. Through empirical evaluation, we show that ParDOM yields better scalability than PXP [23] – a recently proposed parallel DOM parsing algorithm – on commodity multicore processors. Furthermore, ParDOM can process a wide-variety of XML datasets with complex structures which PXP fails to parse. 1

