Results 1  10
of
120
The Landscape of Parallel Computing Research: A View from Berkeley
 TECHNICAL REPORT, UC BERKELEY
, 2006
"... ..."
OSKI: A library of automatically tuned sparse matrix kernels
 Institute of Physics Publishing
, 2005
"... kernels ..."
(Show Context)
Parallel Spectral Clustering
"... Abstract. Spectral clustering algorithm has been shown to be more effective in finding clusters than most traditional algorithms. However, spectral clustering suffers from a scalability problem in both memory use and computational time when a dataset size is large. To perform clustering on large dat ..."
Abstract

Cited by 38 (2 self)
 Add to MetaCart
(Show Context)
Abstract. Spectral clustering algorithm has been shown to be more effective in finding clusters than most traditional algorithms. However, spectral clustering suffers from a scalability problem in both memory use and computational time when a dataset size is large. To perform clustering on large datasets, we propose to parallelize both memory use and computation on distributed computers. Through an empirical study on a large document dataset of 193, 844 data instances and a large photo dataset of 637, 137, we demonstrate that our parallel algorithm can effectively alleviate the scalability problem. Key words: Parallel spectral clustering, distributed computing 1
AUTOMATING THE FINITE ELEMENT METHOD
, 2006
"... The finite element method can be viewed as a machine that automates the discretization of differential equations, taking as input a variational problem, a finite element and a mesh, and producing as output a system of discrete equations. However, the generality of the framework provided by the finit ..."
Abstract

Cited by 35 (10 self)
 Add to MetaCart
(Show Context)
The finite element method can be viewed as a machine that automates the discretization of differential equations, taking as input a variational problem, a finite element and a mesh, and producing as output a system of discrete equations. However, the generality of the framework provided by the finite element method is seldom reflected in implementations (realizations), which are often specialized and can handle only a small set of variational problems and finite elements (but are typically parametrized over the choice of mesh). This paper reviews ongoing research in the direction of a complete automation of the finite element method. In particular, this work discusses algorithms for the efficient and automatic computation of a system of discrete equations from a given variational problem, finite element and mesh. It is demonstrated that by automatically generating and compiling efficient lowlevel code, it is possible to parametrize a finite element code over variational problem and finite element in addition to the mesh.
A Compact Discontinuous Galerkin (CDG) Method for Elliptic Problems,” submitted
 SIAM J. for Numerical Analaysis
, 2006
"... Abstract. We present a compact discontinuous Galerkin (CDG) method for an elliptic model problem. The problem is first cast as a system of first order equations by introducing the gradient of the primal unknown, or flux, as an additional variable. A standard discontinuous Galerkin (DG) method is the ..."
Abstract

Cited by 34 (14 self)
 Add to MetaCart
(Show Context)
Abstract. We present a compact discontinuous Galerkin (CDG) method for an elliptic model problem. The problem is first cast as a system of first order equations by introducing the gradient of the primal unknown, or flux, as an additional variable. A standard discontinuous Galerkin (DG) method is then applied to the resulting system of equations. The numerical interelement fluxes are such that the equations for the additional variable can be eliminated at the element level, thus resulting in a global system that involves only the original unknown variable. The proposed method is closely related to the local discontinuous Galerkin (LDG) method [B. Cockburn and C.W. Shu, SIAM J. Numer. Anal., 35 (1998), pp. 2440–2463], but, unlike the LDG method, the sparsity pattern of the CDG method involves only nearest neighbors. Also, unlike the LDG method, the CDG method works without stabilization for an arbitrary orientation of the element interfaces. The computation of the numerical interface fluxes for the CDG method is slightly more involved than for the LDG method, but this additional complication is clearly offset by increased compactness and flexibility.
The design and implementation of the MRRR algorithm
 ACM Trans. Math. Software
, 2004
"... In the 1990’s, Dhillon and Parlett devised the algorithm of multiple relatively robust representations (MRRR) for computing numerically orthogonal eigenvectors of a symmetric tridiagonal matrix T with O(n2) cost. While previous publications related to MRRR focused on theoretical aspects of the algor ..."
Abstract

Cited by 27 (4 self)
 Add to MetaCart
In the 1990’s, Dhillon and Parlett devised the algorithm of multiple relatively robust representations (MRRR) for computing numerically orthogonal eigenvectors of a symmetric tridiagonal matrix T with O(n2) cost. While previous publications related to MRRR focused on theoretical aspects of the algorithm, a documentation of software issues has been missing. In this article, we discuss the design and implementation of the new MRRR version STEGR that will be included in the next LAPACK release. By giving an algorithmic description of MRRR and identifying governing parameters, we hope to make STEGR more easily accessible and suitable for future performance tuning. Furthermore, this should help users understand design choices and tradeoffs when using the code.
Stochastic Superoptimization
"... We formulate the loopfree binary superoptimization task as a stochastic search problem. The competing constraints of transformation correctness and performance improvement are encoded as terms in a cost function, and a Markov Chain Monte Carlo sampler is used to rapidly explore the space of all pos ..."
Abstract

Cited by 26 (4 self)
 Add to MetaCart
(Show Context)
We formulate the loopfree binary superoptimization task as a stochastic search problem. The competing constraints of transformation correctness and performance improvement are encoded as terms in a cost function, and a Markov Chain Monte Carlo sampler is used to rapidly explore the space of all possible programs to find one that is an optimization of a given target program. Although our method sacrifices completeness, the scope of programs we are able to consider, and the resulting quality of the programs that we produce, far exceed those of existing superoptimizers. Beginning from binaries compiled by llvmO0 for 64bit x86, our prototype implementation, STOKE, is able to produce programs which either match or outperform the code produced by gccO3, iccO3, and in some cases, expert handwritten assembly.
MultiThreading and OneSided Communication in Parallel LU Factorization
"... Dense LU factorization has a high ratio of computation to communication and, as evidenced by the High Performance Linpack (HPL) benchmark, this property makes it scale well on most parallel machines. Nevertheless, the standard algorithm for this problem has nontrivial dependence patterns which limi ..."
Abstract

Cited by 24 (0 self)
 Add to MetaCart
(Show Context)
Dense LU factorization has a high ratio of computation to communication and, as evidenced by the High Performance Linpack (HPL) benchmark, this property makes it scale well on most parallel machines. Nevertheless, the standard algorithm for this problem has nontrivial dependence patterns which limit parallelism, and local computations require large matrices in order to achieve good single processor performance. We present an alternative programming model for this type of problem, which combines UPC's global address space with lightweight multithreading. We introduce the concept of memoryconstrained lookahead where the amount of concurrency managed by each processor is controlled by the amount of memory available. We implement novel techniques for steering the computation to optimize for high performance and demonstrate the scalability and portability of UPC with Teraflop level performance on some machines, comparing favourably to other stateoftheart MPI codes.
Implementation of a primaldual method for SDP on a shared memory parallel architecture
 Computational Optimization and Applications
, 2006
"... Primal–dual interior point methods and the HKM method in particular have been implemented in a number of software packages for semidefinite programming. These methods have performed well in practice on small to medium sized SDP’s. However, primal–dual codes have had some trouble in solving larger ..."
Abstract

Cited by 23 (0 self)
 Add to MetaCart
(Show Context)
Primal–dual interior point methods and the HKM method in particular have been implemented in a number of software packages for semidefinite programming. These methods have performed well in practice on small to medium sized SDP’s. However, primal–dual codes have had some trouble in solving larger problems because of the storage requirements and required computational effort. In this paper we describe a parallel implementation of the primaldual method on a shared memory system. Computational results are presented, including the solution of some large scale problems with over 50,000 constraints.