Results 1 - 10
of
207
DBMSs on a modern processor: Where does time go
- in VLDB
, 1999
"... Recent high-performance processors employ sophisticated techniques to overlap and simultaneously execute multiple computation and memory operations. Intuitively, these techniques should help database applications, which are becoming increasingly compute and memory bound. Unfortunately, recent studie ..."
Abstract
-
Cited by 246 (26 self)
- Add to MetaCart
(Show Context)
Recent high-performance processors employ sophisticated techniques to overlap and simultaneously execute multiple computation and memory operations. Intuitively, these techniques should help database applications, which are becoming increasingly compute and memory bound. Unfortunately, recent studies report that faster processors do not improve database system performance to the same extent as scientific workloads. Recent work on database systems focusing on minimizing memory latencies, such as cache-conscious algorithms for sorting and data placement, is one step toward addressing this problem. However, to best design high performance DBMSs we must carefully evaluate and understand the processor and memory behavior of commercial DBMSs on today’s hardware platforms. In this paper we answer the question “Where does time go when a database system is executed on a modern computer platform? ” We examine four commercial DBMSs running on an Intel Xeon and NT 4.0. We introduce a framework for analyzing query execution time on a DBMS running on a server with a modern processor and memory architecture. To focus on processor and memory interactions and exclude effects from the I/O subsystem, we use a memory resident database. Using simple queries we find that database developers should (a) optimize data placement for the second level of data cache, and not the first, (b) optimize instruction placement to reduce first-level instruction cache stalls, but (c) not expect the overall execution time to decrease significantly without addressing stalls related to subtle implementation issues (e.g., branch prediction). 1
DieHard: probabilistic memory safety for unsafe languages
- in PLDI ’06
, 2006
"... Applications written in unsafe languages like C and C++ are vulnerable to memory errors such as buffer overflows, dangling pointers, and reads of uninitialized data. Such errors can lead to program crashes, security vulnerabilities, and unpredictable behavior. We present DieHard, a runtime system th ..."
Abstract
-
Cited by 188 (20 self)
- Add to MetaCart
(Show Context)
Applications written in unsafe languages like C and C++ are vulnerable to memory errors such as buffer overflows, dangling pointers, and reads of uninitialized data. Such errors can lead to program crashes, security vulnerabilities, and unpredictable behavior. We present DieHard, a runtime system that tolerates these errors while probabilistically maintaining soundness. DieHard uses randomization and replication to achieve probabilistic memory safety by approximating an infinite-sized heap. DieHard’s memory manager randomizes the location of objects in a heap that is at least twice as large as required. This algorithm prevents heap corruption and provides a probabilistic guarantee of avoiding memory errors. For additional safety, DieHard can operate in a replicated mode where multiple replicas of the same application are run simultaneously. By initializing each replica with a different random seed and requiring agreement on output, the replicated version of Die-Hard increases the likelihood of correct execution because errors are unlikely to have the same effect across all replicas. We present analytical and experimental results that show DieHard’s resilience to a wide range of memory errors, including a heap-based buffer overflow in an actual application.
Cache-conscious structure definition
- In PLDI ’99: Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
, 1999
"... A program’s cache performance can be improved by changing the organization and layout of its data-even complex, pointer-based data structures. Previous techniques improved the cache performance of these structures by arranging distinct instances to increase reference locality. These techniques produ ..."
Abstract
-
Cited by 134 (9 self)
- Add to MetaCart
A program’s cache performance can be improved by changing the organization and layout of its data-even complex, pointer-based data structures. Previous techniques improved the cache performance of these structures by arranging distinct instances to increase reference locality. These techniques produced significant performance improvements, but worked best for small structures that could be packed into a cache block. This paper extends that work by concentrating on the internal organization of fields in a data structure. It describes two techniquesstructure splitting and field reordering--that improve the cache behavior of structures larger than a cache block. For structures comparable in size to a cache block, structure splitting can increase the number of hot fields that can be placed in a cache block. In five Java programs, structure splitting reduced cache miss rates 1 O-27% and improved performance 648 % beyond the benefits of previously described cache-conscious reorganization techniques. For large structures, which span many cache blocks, reordering fields, to place those with high temporal affinity in the same cache block can also improve cache utilization. This paper describes bbcache, a tool that recommends C structure field reorderings. Preliminary measurements indicate that reordering fields in 5 active structures improves the performance of Microsoft SQL Server 7.0
The gSOAP Toolkit for Web Services and Peer-To-Peer Computing Networks
- proceedings of the 2nd IEEE International Symposium on Cluster Computing and the Grid (CCGrid2002)
, 2002
"... This paper presents the gSOAP stub and skeleton compiler. The compiler provides a unique SOAP-to-C/C++ language binding for deploying C/C++ applications in SOAP Web Services, clients, and peer-to-peer computing networks. gSOAP enables the integratation of (legacy) C/C++/Fortran codes, embedded syste ..."
Abstract
-
Cited by 122 (15 self)
- Add to MetaCart
(Show Context)
This paper presents the gSOAP stub and skeleton compiler. The compiler provides a unique SOAP-to-C/C++ language binding for deploying C/C++ applications in SOAP Web Services, clients, and peer-to-peer computing networks. gSOAP enables the integratation of (legacy) C/C++/Fortran codes, embedded systems, and real-time software in Web Services, clients, and peers that share computational resources and information with other SOAPenabled applications, possibly across different platforms, language environments, and disparate organizations located behind firewalls. Results on interoperability, legacy code integration, scalability, and performance are given.
Effective jump-pointer prefetching for linked data structures
- In Proceedings of the 26th Annual International Symposium on Computer Architecture
, 1999
"... Current techniques for prefetching linked data structures (LDS) exploit the work available in one loop iteration or recursive call to overlap pointer chasing latency. Jumppointers, which provide direct access to non-adjacent nodes, can be used for prefetching when loop and recursive procedure bodies ..."
Abstract
-
Cited by 102 (1 self)
- Add to MetaCart
(Show Context)
Current techniques for prefetching linked data structures (LDS) exploit the work available in one loop iteration or recursive call to overlap pointer chasing latency. Jumppointers, which provide direct access to non-adjacent nodes, can be used for prefetching when loop and recursive procedure bodies are small and do not have sufficient work to overlap a long latency. This paper describes a framework for jump-pointer prefetching (JPP) that supports four prefetching idioms: queue, full, chain, and root jumping and three implementations: software-only, hardware-only, and a cooperative software/hardware technique. On a suite of pointer intensive programs, jumppointer prefetching reduces memory stall time by 72 % for software, 83 % for cooperative and 55 % for hardware, producing speedups of 15%, 20 % and 22 % respectively. 1
Efficient Representation and Abstractions for Quantifying and Exploiting Data Reference Locality
, 2001
"... With the growing processor-memory performance gap, understanding and optimizing a program's reference locality, and consequently, its cache performance, is becoming increasingly important. Unfortunately, current reference locality optimizations rely on heuristics and are fairly ad-hoc. In a ..."
Abstract
-
Cited by 99 (4 self)
- Add to MetaCart
With the growing processor-memory performance gap, understanding and optimizing a program's reference locality, and consequently, its cache performance, is becoming increasingly important. Unfortunately, current reference locality optimizations rely on heuristics and are fairly ad-hoc. In addition, while optimization technology for improving instruction cache performance is fairly mature (though heuristic-based), data cache optimizations are still at an early stage. We believe the primary reason for this imbalance is the lack of a suitable representation of a program's dynamic data reference behavior and a quantitative basis for understanding this behavior. We address these issues by proposing a quantitative basis for understanding and optimizing reference locality, and by describing efficient data reference representations and an exploitable locality abstraction that support this framework. Our data reference representations (Whole Program Streams and Stream Flow Graphs) are compact---two to four orders of magnitude smaller than the program's data reference trace---and permit efficient analysis---on the order of seconds to a few minutes---even for complex applications. These representations can be used to efficiently compute our exploitable locality abstraction (hot data streams). We demonstrate that these representations and our hot data stream abstraction are useful for quantifying and exploiting data reference locality. We applied our framework to several SPECint 2000 benchmarks, a graphics program, and a commercial Microsoft database application. The results suggest significant opportunity for hot data stream-based locality optimizations. 1.
Data Flow Analysis for Software Prefetching Linked Data Structures in Java
"... In this paper, we describe an effective compile-time analysis for software prefetching in Java. Previous work in software data prefetching for pointer-based codes uses simple compiler algorithms and does not investigate prefetching for object-oriented language features that make compiletime analysis ..."
Abstract
-
Cited by 94 (7 self)
- Add to MetaCart
(Show Context)
In this paper, we describe an effective compile-time analysis for software prefetching in Java. Previous work in software data prefetching for pointer-based codes uses simple compiler algorithms and does not investigate prefetching for object-oriented language features that make compiletime analysis difficult. We develop a new data flow analysis to detect regular accesses to linked data structures in Java programs. We use intra and interprocedural analysis to identify profitable prefetching opportunities for greedy and jump-pointer prefetching, and we implement these techniques in a compiler for Java. Our results show that both prefetching techniques improve four of our ten programs. The largest performance improvement is 48 % with jump-pointers, but consistent improvements are difficult to obtain.
Reconsidering custom memory allocation
- In Proceedings of the Conference on Object-Oriented Programming: Systems, Languages, and Applications (OOPSLA) 2002
, 2002
"... Programmers hoping to achieve performance improvements often use custom memory allocators. This in-depth study examines eight applications that use custom allocators. Surprisingly, for six of these applications, a state-of-the-art general-purpose allocator (the Lea allocator) performs as well as or ..."
Abstract
-
Cited by 94 (14 self)
- Add to MetaCart
(Show Context)
Programmers hoping to achieve performance improvements often use custom memory allocators. This in-depth study examines eight applications that use custom allocators. Surprisingly, for six of these applications, a state-of-the-art general-purpose allocator (the Lea allocator) performs as well as or better than the custom allocators. The two exceptions use regions, which deliver higher performance (improvements of up to 44%). Regions also reduce programmer burden and eliminate a source of memory leaks. However, we show that the inability of programmers to free individual objects within regions can lead to a substantial increase in memory consumption. Worse, this limitation precludes the use of regions for common programming idioms, reducing their usefulness. We present a generalization of general-purpose and region-based allocators that we call reaps. Reaps are a combination of regions and heaps, providing a full range of region semantics with the addition of individual object deletion. We show that our implementation of reaps provides high performance, outperforming other allocators with region-like semantics. We then use a case study to demonstrate the space advantages and software engineering benefits of reaps in practice. Our results indicate that programmers needing fast regions should use reaps, and that most programmers considering custom allocators should instead use the Lea allocator.
Automatic pool allocation: improving performance by controlling data structure layout in the heap
- In Proceedings of PLDI
, 2005
"... This paper describes Automatic Pool Allocation, a transformation framework that segregates distinct instances of heap-based data structures into seperate memory pools and allows heuristics to be used to partially control the internal layout of those data structures. The primary goal of this work is ..."
Abstract
-
Cited by 82 (9 self)
- Add to MetaCart
(Show Context)
This paper describes Automatic Pool Allocation, a transformation framework that segregates distinct instances of heap-based data structures into seperate memory pools and allows heuristics to be used to partially control the internal layout of those data structures. The primary goal of this work is performance improvement, not automatic memory management, and the paper makes several new contributions. The key contribution is a new compiler algorithm for partitioning heap objects in imperative programs based on a context-sensitive pointer analysis, including a novel strategy for correct handling of indirect (and potentially unsafe) function calls. The transformation does not require type safe programs and works for the full generality of C and C++. Second, the paper describes several optimizations that exploit data structure partitioning to fur-ther improve program performance. Third, the paper evaluates how memory hierarchy behavior and overall program performance are impacted by the new transformations. Using a number of bench-marks and a few applications, we find that compilation times are extremely low, and overall running times for heap intensive pro-grams speed up by 10-25 % in many cases, about 2x in two cases, and more than 10x in two small benchmarks. Overall, we believe this work provides a new framework for optimizing pointer inten-sive programs by segregating and controlling the layout of heap-based data structures.
Characterizing the Memory Behavior of Java Workloads: A Structured View and Opportunities for Optimizations
, 2000
"... This paper studies the memory behavior of important Java workloads used in benchmarking Java Virtual Machines (JVMs), based on instrumentation of both application and library code in a state-of-theart JVM, and provides structured information about these workloads to help guide systems' design. ..."
Abstract
-
Cited by 63 (4 self)
- Add to MetaCart
This paper studies the memory behavior of important Java workloads used in benchmarking Java Virtual Machines (JVMs), based on instrumentation of both application and library code in a state-of-theart JVM, and provides structured information about these workloads to help guide systems' design. We begin by characterizing the inherent memory behavior of the benchmarks, such as information on the breakup of heap accesses among different categories and on the hotness of references to fields and methods. We then provide detailed information about misses in the data TLB and caches, including the distribution of misses over different kinds of accesses and over different methods. In the process, we make interesting discoveries about TLB behavior and limitations of data prefetching schemes discussed in the literature in dealing with pointer-intensive Java codes. Throughout this paper, we develop a set of recommendations to computer architects and compiler writers on how to optimize computer systems and system software to run Java programs more efficiently. This paper also makes the first attempt to compare the characteristics of SPECjvm98 to those of a server-oriented benchmark, pBOB, and explain why the current set of SPECjvm98 benchmarks may not be adequate for a comprehensive and objective evaluation of JVMs and just-in-time (JIT) compilers. We discover that the fraction of accesses to array elements is quite significant, demonstrate that the number of "hot spots" in the benchmarks is small, and show that field reordering cannot yield significant performance gains. We also show that even a fairly large L2 data cache is not effective for many Java benchmarks. We observe that instructions used to prefetch data into the L2 data cache are often squashed because of high TLB miss ...