Results 1 - 10
of
250
Language Virtualization for Heterogeneous Parallel Computing
"... As heterogeneous parallel systems become dominant, application developers are being forced to turn to an incompatible mix of low level programming models (e.g. OpenMP, MPI, CUDA, OpenCL). However, these models do little to shield developers from the difficult problems of parallelization, data decomp ..."
Abstract
-
Cited by 35 (10 self)
- Add to MetaCart
(Show Context)
As heterogeneous parallel systems become dominant, application developers are being forced to turn to an incompatible mix of low level programming models (e.g. OpenMP, MPI, CUDA, OpenCL). However, these models do little to shield developers from the difficult problems of parallelization, data decomposition and machine-specific details. Most programmers are having a difficult time using these programming models effectively. To provide a programming model that addresses the productivity and performance requirements for the average programmer, we explore a domainspecific approach to heterogeneous parallel programming. We propose language virtualization as a new principle that enables the construction of highly efficient parallel domain specific languages that are embedded in a common host language. We define criteria for language virtualization and present techniques to achieve them. We present two concrete case studies of domain-specific languages that are implemented using our virtualization approach.
Scalable concurrent hash tables via relativistic programming
- Operating Systems Review
, 2010
"... We present algorithms for shrinking and expanding a hash table while allowing concurrent, wait-free, linearly scalable lookups. These resize algorithms allow Read-Copy Update (RCU) hash tables to maintain constanttime performance as the number of entries grows, and reclaim memory as the number of en ..."
Abstract
-
Cited by 29 (12 self)
- Add to MetaCart
(Show Context)
We present algorithms for shrinking and expanding a hash table while allowing concurrent, wait-free, linearly scalable lookups. These resize algorithms allow Read-Copy Update (RCU) hash tables to maintain constanttime performance as the number of entries grows, and reclaim memory as the number of entries decreases, without delaying or disrupting readers. We call the resulting data structure a relativistic hash table. Benchmarks of relativistic hash tables in the Linux kernel show that lookup scalability during resize improves 125x over reader-writer locking, and 56 % over Linux’s current state of the art. Relativistic hash lookups experience no performance degradation during a resize. Applying this algorithm to memcached removes a scalability limit for get requests, allowing memcached to scale linearly and service up to 46 % more requests per second. Relativistic hash tables demonstrate the promise of a new concurrent programming methodology known as relativistic programming. Relativistic programming makes novel use of existing RCU synchronization primitives, namely the wait-for-readers operation that waits for unfinished readers to complete. This operation, conventionally used to handle reclamation, here allows ordering of updates without read-side synchronization or memory barriers. 1
NZTM: Nonblocking zero-indirection transactional memory
- In Workshop on Transactional Computing (TRANSACT
, 2007
"... This workshop paper reports work in progress on NZTM, a nonblocking, zero-indirection object-based hybrid transactional memory system. NZTM can execute transactions using best-effort hardware transactional memory if it is available and effective, but can execute transactions using NZSTM, our compati ..."
Abstract
-
Cited by 28 (3 self)
- Add to MetaCart
(Show Context)
This workshop paper reports work in progress on NZTM, a nonblocking, zero-indirection object-based hybrid transactional memory system. NZTM can execute transactions using best-effort hardware transactional memory if it is available and effective, but can execute transactions using NZSTM, our compatible software transactional memory system otherwise. Previous nonblocking software and hybrid transactional memory implementations pay a significant performance cost in the common case, as compared to simpler, blocking ones. However, blocking is problematic in some cases and unacceptable in others. NZTM is nonblocking, but shares the advantages of recent blocking STM proposals in the common case: it stores object data “in place”, thus avoiding the costly levels of indirection in previous nonblocking STMs, and improves cache performance by collocating object metadata with the data it controls. 1.
OptiML: an implicitly parallel domainspecific language for machine learning
- in Proceedings of the 28th International Conference on Machine Learning, ser. ICML
, 2011
"... As the size of datasets continues to grow, machine learning applications are becoming increasingly limited by the amount of available computational power. Taking advantage of modern hardware requires using multiple parallel programming models targeted at different devices (e.g. CPUs and GPUs). Howev ..."
Abstract
-
Cited by 28 (11 self)
- Add to MetaCart
(Show Context)
As the size of datasets continues to grow, machine learning applications are becoming increasingly limited by the amount of available computational power. Taking advantage of modern hardware requires using multiple parallel programming models targeted at different devices (e.g. CPUs and GPUs). However, programming these devices to run efficiently and correctly is difficult, error-prone, and results in software that is harder to read and maintain. We present OptiML, a domain-specific language (DSL) for machine learning. OptiML is an implicitly parallel, expressive and high performance alternative to MATLAB and C++. OptiML performs domain-specific analyses and optimizations and automatically generates CUDA code for GPUs. We show that OptiML outperforms explicitly parallelized MATLAB code in nearly all cases. 1.
A parallel approach to xml parsing
- In The 7th IEEE/ACM International Conference on Grid Computing
, 2006
"... Abstract — A language for semi-structured documents, XML has emerged as the core of the web services architecture, and is playing crucial roles in messaging systems, databases, and document processing. However, the processing of XML documents has a reputation for poor performance, and a number of op ..."
Abstract
-
Cited by 26 (4 self)
- Add to MetaCart
(Show Context)
Abstract — A language for semi-structured documents, XML has emerged as the core of the web services architecture, and is playing crucial roles in messaging systems, databases, and document processing. However, the processing of XML documents has a reputation for poor performance, and a number of optimizations have been developed to address this performance problem from different perspectives, none of which have been entirely satisfactory. In this paper, we present a seemingly quixotic, but novel approach: parallel XML parsing. Parallel XML parsing leverages the growing prevalence of multicore architectures in all sectors of the computer market, and yields significant performance improvements. This paper presents our design and implementation of parallel XML parsing. Our design consists of an initial preparsing phase to determine the structure of the XML document, followed by a full, parallel parse. The results of the preparsing phase are used to help partition the XML document for data parallel processing. Our parallel parsing phase is a modification of the libxml2 [1] XML parser, which shows that our approach applies to real-world, production quality parsers. Our empirical study shows our parallel XML parsing algorithm can improved the XML parsing performance significantly and scales well. I.
Reusable aspect-oriented implementations of concurrency patterns and mechanisms
- In AOSD ’06: Proceedings of the 5th international conference on Aspect-oriented software development
, 2006
"... É autorizada a reprodução integral desta dissertação, apenas para efeitos de investigação, mediante declaração escrita do interessado, que a tal se compromete. Concurrent programming is more complex than sequential programming, which requires from programmers an additional effort to manage such incr ..."
Abstract
-
Cited by 22 (3 self)
- Add to MetaCart
(Show Context)
É autorizada a reprodução integral desta dissertação, apenas para efeitos de investigação, mediante declaração escrita do interessado, que a tal se compromete. Concurrent programming is more complex than sequential programming, which requires from programmers an additional effort to manage such increase of complexity in software. Concurrency requires the specification of actions that can occur simultaneously and the prevention of undesirable interactions between these concurrent actions. The lack of suitable software abstractions to specify concurrent behaviour makes concurrent software harder to develop and maintain as well as reduces their reuse potential. Several high level concurrency constructs have been proposed in the last fifteen years in Concurrent Object Oriented Languages (COOLs). These constructs can help to structure concurrent programs to manage this increase in complexity. However, current mainstream object oriented languages, such as Java or C#, do not include many of these high level abstractions. More recently, some of these constructs were revisited as a set of patterns, which represent recurring solutions to frequently occurring concurrency problems.
Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures
"... Scaling computations on emerging massive-core supercomputers is a daunting task, which coupled with the significantly lagging system I/O capabilities exacerbates applications ’ end-to-end performance. The I/O bottleneck often negates potential performance benefits of assigning additional compute cor ..."
Abstract
-
Cited by 21 (9 self)
- Add to MetaCart
(Show Context)
Scaling computations on emerging massive-core supercomputers is a daunting task, which coupled with the significantly lagging system I/O capabilities exacerbates applications ’ end-to-end performance. The I/O bottleneck often negates potential performance benefits of assigning additional compute cores to an application. In this paper, we address this issue via a novel functional partitioning (FP) runtime environment that allocates cores to specific application tasks — checkpointing, de-duplication, and scientific data format transformation — so that the deluge of cores can be brought to bear on the entire gamut of application activities. The focus is on utilizing the extra cores to support HPC application I/O activities and also leverage solid-state disks in this context. For example, our evaluation shows that dedicating 1 core on an oct-core machine for checkpointing and its assist tasks using FP can improve overall execution time of a FLASH benchmark on 80 and 160 cores by 43.95 % and 41.34%, respectively. I.
Scientific Workflow Systems for 21st Century, New Bottle or New Wine
- IEEE Workshop on Scientific Workflows
, 2008
"... With the advances in e-Sciences and the growing complexity of scientific analyses, more and more scientists and researchers are relying on workflow systems for process coordination, derivation automation, provenance tracking, and bookkeeping. While workflow systems have been in use for decades, it i ..."
Abstract
-
Cited by 20 (6 self)
- Add to MetaCart
(Show Context)
With the advances in e-Sciences and the growing complexity of scientific analyses, more and more scientists and researchers are relying on workflow systems for process coordination, derivation automation, provenance tracking, and bookkeeping. While workflow systems have been in use for decades, it is unclear whether scientific workflows can or even should build on existing workflow technologies, or they require fundamentally new approaches. In this paper, we analyze the status and challenges of scientific workflows, investigate both existing technologies and emerging languages, platforms and systems, and identify the key challenges that must be addressed by workflow systems for e-science in the 21 st century. 1.
A static load-balancing scheme for parallel xml parsing on multicore cpus
- In CCGrid’07 (IEEE International Symposium on Cluster Computing and the Grid ), Rio de Janeiro
, 2007
"... A number of techniques to improve the parsing performance of XML have been developed. Generally, however, these techniques have limited impact on the construction of a DOM tree, which can be a significant bottleneck. Meanwhile, the trend in hardware technology is toward an increasing number of cores ..."
Abstract
-
Cited by 20 (4 self)
- Add to MetaCart
(Show Context)
A number of techniques to improve the parsing performance of XML have been developed. Generally, however, these techniques have limited impact on the construction of a DOM tree, which can be a significant bottleneck. Meanwhile, the trend in hardware technology is toward an increasing number of cores per CPU. As we have shown in previous work, these cores can be used to parse XML in parallel, resulting in significant speedups. In this paper, we introduce a new static partitioning and load-balancing mechanism. By using a static, global approach, we reduce synchronization and load-balancing overhead, thus improving performance over dynamic schemes for a large class of XML documents. Our approach leverages libxml2 without modification, which reduces development effort and shows that our approach is applicable to real-world, production parsers. Our scheme works well with Sun’s Niagara class of CMT architectures, and shows that multiple hardware threads can be effectively used for XML parsing. 1.