Results 1  10
of
38
Symbolic Bounds Analysis of Pointers, Array Indices, and Accessed Memory Regions
 PLDI 2000
, 2000
"... This paper presents a novel framework for the symbolic bounds analysis of pointers, array indices, and accessed memory regions. Our framework formulates each analysis problem as a system of inequality constraints between symbolic bound polynomials. It then reduces the constraint system to a linear p ..."
Abstract

Cited by 135 (15 self)
 Add to MetaCart
(Show Context)
This paper presents a novel framework for the symbolic bounds analysis of pointers, array indices, and accessed memory regions. Our framework formulates each analysis problem as a system of inequality constraints between symbolic bound polynomials. It then reduces the constraint system to a linear program. The solution to the linear program provides symbolic lower and upper bounds for the values of pointer and array index variables and for the regions of memory that each statement and procedure accesses. This approach eliminates fundamental problems associated with applying standard xedpoint approaches to symbolic analysis problems. Experimental results from our implemented compiler show that the analysis can solve several important problems, including static race detection, automatic parallelization, static detection of array bounds violations, elimination of array bounds checks, and reduction of the number of bits used to store computed values.
Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software
 SIAM REVIEW C ○ 2004 SOCIETY FOR INDUSTRIAL AND APPLIED MATHEMATICS VOL. 46, NO. 1, PP. 3–45
, 2004
"... Matrix computations are both fundamental and ubiquitous in computational science and its vast application areas. Along with the development of more advanced computer systems with complex memory hierarchies, there is a continuing demand for new algorithms and library software that efficiently utilize ..."
Abstract

Cited by 81 (6 self)
 Add to MetaCart
Matrix computations are both fundamental and ubiquitous in computational science and its vast application areas. Along with the development of more advanced computer systems with complex memory hierarchies, there is a continuing demand for new algorithms and library software that efficiently utilize and adapt to new architecture features. This article reviews and details some of the recent advances made by applying the paradigm of recursion to dense matrix computations on today’s memorytiered computer systems. Recursion allows for efficient utilization of a memory hierarchy and generalizes existing fixed blocking by introducing automatic variable blocking that has the potential of matching every level of a deep memory hierarchy. Novel recursive blocked algorithms offer new ways to compute factorizations such as Cholesky and QR and to solve matrix equations. In fact, the whole gamut of existing dense linear algebra factorization is beginning to be reexamined in view of the recursive paradigm. Use of recursion has led to using new hybrid data structures and optimized superscalar kernels. The results we survey include new algorithms and library software implementations for level 3 kernels, matrix factorizations, and the solution of general systems of linear equations and several common matrix equations. The software implementations we survey are robust and show impressive performance on today’s high performance computing systems.
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs
 in Proc. of EuroPar
, 2009
"... Abstract While generalpurpose homogeneous multicore architectures are becoming ubiquitous, there are clear indications that, for a number of important applications, a better performance/power ratio can be attained using specialized hardware accelerators. These accelerators require specific SDK o ..."
Abstract

Cited by 50 (3 self)
 Add to MetaCart
(Show Context)
Abstract While generalpurpose homogeneous multicore architectures are becoming ubiquitous, there are clear indications that, for a number of important applications, a better performance/power ratio can be attained using specialized hardware accelerators. These accelerators require specific SDK or programming languages which are not always easy to program. Thus, the impact of the new programming paradigms on the programmer’s productivity will determine their success in the highperformance computing arena. In this paper we present GPU Superscalar (GPUSs), an extension of the Star Superscalar programming model that targets the parallelization of applications on platforms consisting of a generalpurpose processor connected with multiple graphics processors. GPUSs deals with architecture heterogeneity and separate memory address spaces, while preserving simplicity and portability. Preliminary experimental results for a wellknown operation in numerical linear algebra illustrate the correct adaptation of the runtime to a multiGPU system, attaining notable performance results. Key words: Tasklevel parallelism, graphics processors, heterogeneous systems, programming models 1
Regularity lemmas and combinatorial algorithms
 In Proc. FOCS
"... Abstract — We present new combinatorial algorithms for Boolean matrix multiplication (BMM) and preprocessing a graph to answer independent set queries. We give the first asymptotic improvements on combinatorial algorithms for dense BMM in many years, improving on the “Four Russians ” O(n 3 /(w log n ..."
Abstract

Cited by 19 (3 self)
 Add to MetaCart
(Show Context)
Abstract — We present new combinatorial algorithms for Boolean matrix multiplication (BMM) and preprocessing a graph to answer independent set queries. We give the first asymptotic improvements on combinatorial algorithms for dense BMM in many years, improving on the “Four Russians ” O(n 3 /(w log n)) bound for machine models with wordsize w. (For a pointer machine, we can set w = log n.) The algorithms utilize notions from Regularity Lemmas for graphs in a novel way. • We give two randomized combinatorial algorithms for BMM. The first algorithm is essentially a reduction from BMM to the Triangle Removal Lemma. The best known bounds for the Triangle Removal Lemma only imply an O ` (n 3 log β)/(βw log n) ´ time algorithm for BMM where β = (log ∗ n) δ for some δ> 0, but improvements on the Triangle Removal Lemma would yield corresponding runtime improvements. The second algorithm applies the Weak Regularity Lemma of Frieze and Kannan along with “ several information compression ideas, running in O n 3 (log log n) 2 /(log n) 9/4 ”) time with probability exponentially “ close to 1. When w ≥ log n, it can be implemented in O n 3 (log log n) 2 /(w log n) 7/6 ”) time. Our results immediately imply improved combinatorial methods for CFG parsing, detecting trianglefreeness, and transitive closure. Using Weak Regularity, we also give an algorithm for answering queries of the form is S ⊆ V an independent set? in a graph. Improving on prior work, we show how to randomly preprocess a graph in O(n 2+ε) time (for all ε> 0) so that with high probability, all subsequent batches of log n independent “ set queries can be answered deterministically in O n 2 (log log n) 2 /((log n) 5/4 ”) time. When w ≥ log n, w queries can be answered in O n 2 (log log n) 2 /((log n) 7/6 ” time. In addition to its nice applications, this problem is interesting in that it is not known how to do better than O(n 2) using “algebraic ” methods. 1.
Processor Allocation on Cplant: Achieving General Processor Locality Using OneDimensional Allocation Strategies
 In Proc. 4th IEEE International Conference on Cluster Computing
, 2002
"... The Computational Plant or Cplant is a commoditybased supercomputer under development at Sandia National Laboratories. This paper describes resourceallocation strategies to achieve processor locality for parallel jobs in Cplant and other supercomputers. Users of Cplant and other Sandia supercomput ..."
Abstract

Cited by 17 (5 self)
 Add to MetaCart
(Show Context)
The Computational Plant or Cplant is a commoditybased supercomputer under development at Sandia National Laboratories. This paper describes resourceallocation strategies to achieve processor locality for parallel jobs in Cplant and other supercomputers. Users of Cplant and other Sandia supercomputers submit parallel jobs to a job queue. When a job is scheduled to run, it is assigned to a set of processors. To obtain maximum throughput, jobs should be allocated to localized clusters of processors to minimize communication costs and to avoid bandwidth contention caused by overlapping jobs.
Scheduling of QR factorization algorithms on SMP and multicore architectures
 IN PDP ’08: PROCEEDINGS OF THE SIXTEENTH EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORKBASED PROCESSING
, 2008
"... This paper examines the scalable parallel implementation of QR factorization of a general matrix, targeting SMP and multicore architectures. Two implementations of algorithmsbyblocks are presented. Each implementation views a block of a matrix as the fundamental unit of data, and likewise, operat ..."
Abstract

Cited by 15 (9 self)
 Add to MetaCart
This paper examines the scalable parallel implementation of QR factorization of a general matrix, targeting SMP and multicore architectures. Two implementations of algorithmsbyblocks are presented. Each implementation views a block of a matrix as the fundamental unit of data, and likewise, operations over these blocks as the primary unit of computation. The first is a conventional blocked algorithm similar to those included in libFLAME and LAPACK but expressed in a way that allows operations in the socalled critical path of execution to be computed as soon as their dependencies are satisfied. The second algorithm captures a higher degree of parallelism with an approach based on Givens rotations while preserving the performance benefits of algorithms based on blocked Householder transformations. We show that the implementation effort is greatly simplified by expressing the algorithms in code with the FLAME/FLASH API, which allows matrices stored by blocks to be viewed and managed as matrices of matrix blocks. The SuperMatrix runtime system utilizes FLASH to assemble and represent matrices but also provides outoforder scheduling of operations that is transparent to the programmer. Scalability of the solution is demonstrated on a ccNUMA platform with 16 processors.
NetworkOblivious Algorithms
 IN PROC. OF 21ST INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM
, 2007
"... The design of algorithms that can run unchanged yet efficiently on a variety of machines characterized by different degrees of parallelism and communication capabilities is a highly desirable goal. We propose a framework for networkobliviousness based on a model of computation where the only parame ..."
Abstract

Cited by 14 (5 self)
 Add to MetaCart
(Show Context)
The design of algorithms that can run unchanged yet efficiently on a variety of machines characterized by different degrees of parallelism and communication capabilities is a highly desirable goal. We propose a framework for networkobliviousness based on a model of computation where the only parameter is the problem’s input size. Algorithms are then evaluated on a model with two parameters, capturing parallelism and granularity of communication. We show that, for a wide class of networkoblivious algorithms, optimality in the latter model implies optimality in a blockvariant of the Decomposable BSP model, which effectively describes a wide and significant class of parallel platforms. We illustrate our framework by providing optimal networkoblivious algorithms for a few key problems, and also establish some negative results. 1
Computation regrouping: Restructuring programs for temporal data cache locality
 In Intl. Conf. on Supercomputing
, 2002
"... Data access costs contribute significantly to the execution time of applications with complex data structures. As the latency of memory accesses becomes high relative to processor cycle times, application performance is increasingly limited by memory performance. In some situations it may be reasona ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
(Show Context)
Data access costs contribute significantly to the execution time of applications with complex data structures. As the latency of memory accesses becomes high relative to processor cycle times, application performance is increasingly limited by memory performance. In some situations it may be reasonable to trade increased computation costs for reduced memory costs. The contributions of this paper are threefold: we provide a detailed analysis of the memory performance of a set of seven, memoryintensive benchmarks; we describe Computation Regrouping, a general, sourcelevel approach to improving the overall performance of these applications by improving temporal locality to reduce cache and TLB miss ratios (and thus memory stall times); and we demonstrate significant performance improvements from applying Computation Regrouping to our suite of seven benchmarks.
Designdriven compilation
 In Proceedings of the International Conference on Compiler Construction
, 2001
"... Abstract. This paper introduces designdriven compilation, an approach in which the compiler uses design information to drive its analysis and verify that the program conforms to its design. Although this approach requires the programmer to specify additional design information, it offers a range of ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
(Show Context)
Abstract. This paper introduces designdriven compilation, an approach in which the compiler uses design information to drive its analysis and verify that the program conforms to its design. Although this approach requires the programmer to specify additional design information, it offers a range of benefits, including guaranteed fidelity to the designer's expectations of the code, early and automatic detection of design nonconformance bugs, and support for local analysis, separate compilation, and libraries. It can also simplify the compiler and improve its efficiency. The key to the success of our approach is to combine highlevel design specifications with powerful static analysis algorithms that handle the lowlevel details of verifying the design information.
Restructuring computations for temporal data cache locality
 International Journal of Parallel Programming
, 2003
"... withcomplexdatastructures.Athelatencyofmemoryaccessesbecomeshigh relativetoprocessorcycletimes,applicationperformanceisincreasinglylimited bymemoryperformance.Insomesituationsitisusefultotradeincreasedcomputationcostsforreducedmemorycosts.Thecontributionsofthispaperare ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
withcomplexdatastructures.Athelatencyofmemoryaccessesbecomeshigh relativetoprocessorcycletimes,applicationperformanceisincreasinglylimited bymemoryperformance.Insomesituationsitisusefultotradeincreasedcomputationcostsforreducedmemorycosts.Thecontributionsofthispaperare