Results 1 
2 of
2
Efficient Multiplication of Polynomials on Graphics Hardware
"... Abstract. We present the algorithm to multiply univariate polynomials with integer coefficients efficiently using the Number Theoretic transform (NTT) on Graphics Processing Units (GPU). The same approach can be used to multiply large integers encoded as polynomials. Our algorithm exploits fused mul ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
Abstract. We present the algorithm to multiply univariate polynomials with integer coefficients efficiently using the Number Theoretic transform (NTT) on Graphics Processing Units (GPU). The same approach can be used to multiply large integers encoded as polynomials. Our algorithm exploits fused multiplyadd capabilities of the graphics hardware. NTT multiplications are executed in parallel for a set of distinct primes followed by reconstruction using the Chinese Remainder theorem (CRT) on the GPU. Our benchmarking experiences show the NTT multiplication performance up to 77 GMul/s 1. We compared our approach with CPUbased implementations of polynomial and large integer multiplication provided by NTL and GMP 2 libraries.
Digits of π Calculation
"... Abstract. We present efficient parallel algorithms for multipleprecision arithmetic operations of more than several million decimal digits on distributedmemory parallel computers. A parallel implementation of floatingpoint real FFTbased multiplication is used because a key operation in fast mult ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. We present efficient parallel algorithms for multipleprecision arithmetic operations of more than several million decimal digits on distributedmemory parallel computers. A parallel implementation of floatingpoint real FFTbased multiplication is used because a key operation in fast multipleprecision arithmetic is multiplication. We also parallelized an operation of releasing propagated carries and borrows in multipleprecision addition, subtraction and multiplication. More than 1.6 trillion decimal digits of π were computed on 256 nodes of Appro XtremeX3 (648 nodes, 147.2 GFlops/node, 95.4 TFlops peak performance) with a computing elapsed time of 137 hours 42 minutes which includes the time for verification. 1