Results 1 - 10
of
40
The effect of LUT and cluster size on deepSubmicron FPGA performance and density
- in Proc. IEEE Field Programmable Gate Arrays (FPGA
, 2000
"... Abstract—In this paper, we revisit the field-programmable gatearray (FPGA) architectural issue of the effect of logic block functionality on FPGA performance and density. In particular, in the context of lookup table, cluster-based island-style FPGAs (Betz et al. 1997) we look at the effect of looku ..."
Abstract
-
Cited by 108 (4 self)
- Add to MetaCart
Abstract—In this paper, we revisit the field-programmable gatearray (FPGA) architectural issue of the effect of logic block functionality on FPGA performance and density. In particular, in the context of lookup table, cluster-based island-style FPGAs (Betz et al. 1997) we look at the effect of lookup table (LUT) size and cluster size (number of LUTs per cluster) on the speed and logic density of an FPGA. We use a fully timing-driven experimental flow (Betz et al. 1997), (Marquardt, 1999) in which a set of benchmark circuits are synthesized into different cluster-based (Betz and Rose, 1997, 1998) and (Marquardt, 1999) logic block architectures, which contain groups of LUTs and flip-flops. Across all architectures with LUT sizes in the range of 2 to 7 inputs, and cluster size from 1 to 10 LUTs, we have experimentally determined the relationship between the number of inputs required for a cluster as a function of the LUT size ( ) and cluster size (). Second, contrary to previous results, we have shown that clustering small LUTs (sizes 2 and 3) produces better area results than what was presented in the past. However, our results also show that the performance of FPGAs with these small LUT sizes is significantly worse (by almost a factor of 2) than larger LUTs. Hence, as measured by area-delay product, or by performance, these would be a bad choice. Also, we have discovered that LUT sizes of 5 and 6 produce much better area results than were previously believed. Finally, our results show that a LUT size of 4 to 6 and cluster size of between 3–10 provides the best area-delay product for an FPGA. Index Terms—Architecture, clusters, computer-aided design (CAD), field-programmable gate-array (FPGA), look-up table (LUT), very large scale integration (VLSI). I.
Fast Timing-driven Partitioning-based Placement for Island Style FPGAs
- in Proceedings of the ACM/IEEE Design Automation Conference
, 2003
"... In this paper we propose a partitioning-based placement algorithm for FPGAs. The method incorporates simple, but effective heuristics that target delay minimization. The placement engine incorporates delay estimations obtained from previously placed and routed circuits using VPR [6]. As a result, th ..."
Abstract
-
Cited by 20 (3 self)
- Add to MetaCart
(Show Context)
In this paper we propose a partitioning-based placement algorithm for FPGAs. The method incorporates simple, but effective heuristics that target delay minimization. The placement engine incorporates delay estimations obtained from previously placed and routed circuits using VPR [6]. As a result, the delay predictions during placement more accurately resemble those observed after detailed routing, which in turn leads to better delay optimization. An efficient terminal alignment heuristic for delay minimization is employed to further optimize the delay of the circuit in the routing phase. Simulation results show that the proposed technique can achieve comparable circuit delays (after routing) to those obtained with VPR while achieving a 7-fold speedup in placement runtime.
GraphStep: A System Architecture for Sparse-Graph Algorithms
- In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines. IEEE
, 2006
"... Abstract — Many important applications are organized around long-lived, irregular sparse graphs (e.g., data and knowledge bases, CAD optimization, numerical problems, simulations). The graph structures are large, and the applications need regular access to a large, data-dependent portion of the grap ..."
Abstract
-
Cited by 17 (7 self)
- Add to MetaCart
(Show Context)
Abstract — Many important applications are organized around long-lived, irregular sparse graphs (e.g., data and knowledge bases, CAD optimization, numerical problems, simulations). The graph structures are large, and the applications need regular access to a large, data-dependent portion of the graph for each operation (e.g., the algorithm may need to walk the graph, visiting all nodes, or propagate changes through many nodes in the graph). On conventional microprocessors, the graph structures exceed on-chip cache capacities, making main-memory bandwidth and latency the key performance limiters. To avoid this “memory wall, ” we introduce a concurrent system architecture for sparse graph algorithms that places graph nodes in small distributed memories paired with specialized graph processing nodes interconnected by a lightweight network. This gives us a scalable way to map these applications so that they can exploit the high-bandwidth and low-latency capabilities of embedded memories (e.g., FPGA Block RAMs). On typical spreadingactivation queries on the ConceptNet Knowledge Base, a sample application, this translates into an order of magnitude speedup per FPGA compared to a state-of-the-art Pentium processor. I.
Stream computations organized for reconfigurable execution
, 2006
"... Reconfigurable systems can offer the high spatial parallelism and fine-grained, bit-level resource control traditionally associated with hardware implementations, along with the flexibility and adaptability characteristic of software. While reconfigurable systems create new opportunities for enginee ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
Reconfigurable systems can offer the high spatial parallelism and fine-grained, bit-level resource control traditionally associated with hardware implementations, along with the flexibility and adaptability characteristic of software. While reconfigurable systems create new opportunities for engineering and delivering high-performance programmable systems, the traditional approaches to programming and managing computations used for hardware systems (e.g., Verilog, VHDL) and software systems (e.g., C, Fortran, Java) are inappropriate and inadequate for exploiting reconfigurable platforms. To address this need, we develop a stream-oriented compute model, system architecture, and execution patterns which can capture and exploit the parallelism of spatial computations while simultaneously abstracting software applications from hardware details (e.g., timing, device capacity, and microarchitectural implementation details) and consequently allowing applications to scale to exploit newer, larger, and faster hardware platforms. Further, we describe hardware and software techniques that make this late-bound platform mapping viable and efficient.
SPICE 2 – A Spatial Parallel Architecture for Accelerating the SPICE Circuit Simulator
, 2010
"... Spatial processing of sparse, irregular floating-point computation using a single FPGA enables up to an order of magnitude speedup (mean 2.8 × speedup) over a conventional microprocessor for the SPICE circuit simulator. We deliver this speedup using a hybrid parallel architecture that spatially impl ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
(Show Context)
Spatial processing of sparse, irregular floating-point computation using a single FPGA enables up to an order of magnitude speedup (mean 2.8 × speedup) over a conventional microprocessor for the SPICE circuit simulator. We deliver this speedup using a hybrid parallel architecture that spatially implements the heterogeneous forms of parallelism available in SPICE. We decompose SPICE into its three constituent phases: Model-Evaluation, Sparse Matrix-Solve, and Iteration Control and parallelize each phase independently. We exploit data-parallel device evaluations in the Model-Evaluation phase, sparse dataflow parallelism in the Sparse Matrix-Solve phase and compose the complete design in streaming fashion. We name our parallel architecture SPICE2: Spatial Processors Interconnected for Concurrent Execution for accelerating the SPICE circuit simulator. We program the parallel architecture with a high-level, domain-specific framework that identifies, exposes and exploits parallelism available in the SPICE circuit simulator. Our design is optimized with an auto-tuner that can scale the design to use larger FPGA capacities without expert intervention
Efficient and Deterministic Parallel Placement for FPGAs
"... We describe a parallel simulated annealing algorithm for FPGA placement. The algorithm proposes and evaluates multiple moves in parallel, and has been incorporated into Altera’s Quartus II CAD system. Across a set of 18 industrial benchmark circuits, we achieve geometric average speedups during the ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
We describe a parallel simulated annealing algorithm for FPGA placement. The algorithm proposes and evaluates multiple moves in parallel, and has been incorporated into Altera’s Quartus II CAD system. Across a set of 18 industrial benchmark circuits, we achieve geometric average speedups during the quench of 2.7x and 4.0x on four and eight processors, respectively, with individual circuits achieving speedups of up to 3.6x and 5.9x. Over the course of the entire anneal, we achieve speedups of up to 2.8x and 3.7x, with geometric average speedups of 2.1x and 2.4x. Our algorithm is the first parallel placer to optimize for criteria other than wirelength, such as critical path length, and is one of the few deterministic parallel placement algorithms. We discuss the challenges involved in combining these two features and the new techniques we used to overcome them. We also quantify the impact of maintaining determinism on eight cores, and find that while it reduces performance by approximately 15 % relative to an ideal speedup of 8.0x, hardware limitations are a larger factor and reduce performance by 30–40%. We then suggest possible enhancements to allow our approach to scale to 16 cores and beyond.
New Timing and Routability Driven Placement Algorithms for FPGA Synthesis
, 2007
"... We present new timing and congestion driven FPGA placement algorithms with minimal runtime overhead. By predicting the post-routing critical edges and estimating congestion accurately, our algorithms simultaneously reduce the critical path delay and the minimum number of routing tracks. The core of ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
(Show Context)
We present new timing and congestion driven FPGA placement algorithms with minimal runtime overhead. By predicting the post-routing critical edges and estimating congestion accurately, our algorithms simultaneously reduce the critical path delay and the minimum number of routing tracks. The core of our algorithm consists of a criticalityhistory record of connection edges and a congestion map. This approach is applied to the 20 largest MCNC benchmark circuits. Experimental results show that compared with VPR [2], our algorithms yield an average of 8.1 % reduction (maximum 30.5%) in the critical path delay and 5% reduction in channel width. Meanwhile, the average runtime of our algorithms is only 2.3X as of VPR’s.
Deterministic Timing-Driven Parallel Placement by Simulated Annealing using Half-Box Window Decomposition
"... Abstract—As each generation of FPGAs grow in size, the run time of the associated CAD tools is rapidly increasing. Many past efforts have aimed at improving the CAD run time through parallelization of the placement algorithm. Wang and Lemieux presented an algorithm that is scalable, deterministic, t ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
Abstract—As each generation of FPGAs grow in size, the run time of the associated CAD tools is rapidly increasing. Many past efforts have aimed at improving the CAD run time through parallelization of the placement algorithm. Wang and Lemieux presented an algorithm that is scalable, deterministic, timingdriven and achieves speedup over VPR [Wang and Lemieux FPGA’11]. This paper provides two significant alterations to Wang and Lemieux’s algorithm, resulting in additional speedup and quality improvement. The first contribution is a new data decomposition scheme, called the half-box window technique, which achieves speedup by reducing the frequency of thread synchronization. The second contribution is the development of an improved annealing schedule, which further improves run time and slightly improves the quality of results. Together, these modifications achieve run time speedups of up to 70%. To put this in perspective, Wang and Lemieux required 25 threads to achieve best speedup, while this work requires only 16 threads. For a 10 % degradation in quality, the new 16-thread algorithm achieves a 51x speedup over VPR, compared to a 35x speedup by the 25-thread original algorithm. Regarding quality, the best quality of results achieved by the new algorithm is a 5 % degradation versus VPR, compared to a 8 % degradation of the original Wang and Lemieux algorithm. Index Terms—FPGA; CAD; parallel placement
Self-hosted placement for massively parallel processor arrays
- University of Toronto
, 2009
"... Abstract—We consider the placement problem as part of the CAD flow for a massively parallel processor arrays (MPPAs). In contrast to traditional placers, which operate on a workstation with one or several cores and are able to take advantage of parallelism to a limited degree, we investigate running ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
(Show Context)
Abstract—We consider the placement problem as part of the CAD flow for a massively parallel processor arrays (MPPAs). In contrast to traditional placers, which operate on a workstation with one or several cores and are able to take advantage of parallelism to a limited degree, we investigate running the placer on the target architecture itself. As the number of processor elements (PEs) in such a device scale, so too does the computational power available to the placer. This natural scaling helps avoid the long runtimes that afflict FPGA flows. In this paper, we propose a distributed placer suitable to run on a MPPA. This placer takes advantage of local interconnect fabric, and may be efficiently coded on a simple, RISC-like core. We investigate the performance of this placer and compare it to traditional, simulated annealing-based placers using both unrealistic (but nearly optimal) and realistic (but suboptimal) annealing schedules. On a simulated 32 × 32 = 1024-core MPPA, the proposed algorithm furnishes placements within 5 % of the optimal placement quality – a level competetive with the realistic, traditional placer. To do so, the distributed placer requires each PE to consider 1/256 th as many swaps as the traditional placer, a computational advantage which scales favourably as the number of cores on the MPPA increases. I.