Person:
UĞURDAĞ, Hasan Fatih

Loading...
Profile Picture

Email Address

Birth Date

WoSScopusGoogle ScholarORCID

Name

Job Title

First Name

Hasan Fatih

Last Name

UĞURDAĞ

Publication Search Results

Now showing 1 - 10 of 60
  • Placeholder
    ArticlePublication
    Defect-aware nanocrossbar logic mapping through matrix canonization using two-dimensional radix sort
    (ACM, 2011-08) Gören, S.; Uğurdağ, Hasan Fatih; Palaz, O.; Electrical & Electronics Engineering; UĞURDAĞ, Hasan Fatih
    Nanocrossbars (i.e., nanowire crossbars) offer extreme logic densities but come with very high defect rates; stuck-open/closed, broken nanowires. Achieving reasonable yield and utilization requires logic mapping that is defect-aware even at the crosspoint level. Such logic mapping works with a defect map per each manufactured chip. The problem can be expressed as matching of two bipartite graphs; one for the logic to be implemented and other for the nanocrossbar. This article shows that the problem becomes a Bipartite SubGraph Isomorphism (BSGI) problem within sub-nanocrossbars free of stuck-closed faults. Our heuristic KNS-2DS is an iterative rough canonizer with approximately O(N2) complexity followed by an O(N3) matching algorithm. Canonization brings a partial or full order to graph nodes. It is normally used for solving the regular Graph Isomorphism (GI) problem, while we apply it to BSGI. KNS stands for K-Neighbor Sort and is used for initializing our main contribution 2-Dimensional-Sort (2DS). 2DS operates on the adjacency matrix of a bipartite graph. Radix-2 2DS solves the problem in the absence of stuck-closed faults. With the addition of Radix-3 and our novel Radix-2.5 sort, we solve problems that also have stuck-closed faults. We offer very short runtimes (due to canonization) compared to previous work and have success on all benchmarks. KNS-2DS is also novel from the perspective of BSGI problem as it is based on canonization but not on a search tree with backtracking.
  • Placeholder
    ArticlePublication
    Fast multiplier generator for FPGAs with LUT based partial product generation and column/row compression
    (Elsevier, 2017) Kakacak, Ahmet; Guzel, Aydın Emre; Cihangir, Ozan; Gören, S.; Uğurdağ, Hasan Fatih; Electrical & Electronics Engineering; UĞURDAĞ, Hasan Fatih; Kakacak, Ahmet; Guzel, Aydın Emre; Cihangir, Ozan
    We present a new parallel integer multiplier generator for FPGAs. It combines (i) a new Generalized Parallel Counter (GPC) grouping algorithm for column compression with (ii) a LUT based partial product generation, is (iii) unique as it automatically generates placement pragmas, (iv) uses a ternary adder as a final adder to exploit FPGA's internal carry-chains, and (v) employs a novel GPC based row compression, which aims to reduce the width of the final adder. We wrote Verilog generators for our method as well as one leading work in the literature. For synthesis, we wrote a script that can do “binary search” for the optimum latency. Our extensive implementation results on Xilinx Virtex-6 FPGAs show that we almost always produce circuits with smaller latency (i.e., timing) and Area-Timing Product (ATP) compared to the state-of-the-art in the literature, by 18% and 12% (on the average), respectively. We also offer smaller latency compared to the HDL * operator by 9% on the average at a cost of 12% larger ATP on the average. We are worse in latency in 6 cases out of 33, in all of which synthesis maps * to DSP slices. We also include area and energy results on Virtex-6 as well as a limited amount of latency, area, and ATP results on Virtex-5 and Altera Stratix III.
  • Placeholder
    ArticlePublication
    Fast two-pick n2n round-robin arbiter circuit
    (IEEE, 2012-06) Uğurdağ, Hasan Fatih; Temizkan, Fatih; Baskirt, O.; Yuce, B.; Electrical & Electronics Engineering; UĞURDAĞ, Hasan Fatih; Temizkan, Fatih
    A regular (one-pick) round-robin arbiter circuit picks one active requester (if any) out of n requesters. A two-pick round-robin arbiter selects up to two requesters. An n2n two-pick round-robin arbiter indicates the picked requests with (at most) two-hot n-bit output. A round-robin arbiter is fair to its requesters and does this by repeatedly moving its highest priority pointer to the position immediately next to the second requester picked. Presented is the circuit architecture and VLSI implementation of a new scalable two-pick round-robin arbiter with low latency, which is compared with previous work based on logic synthesis results.
  • Placeholder
    ArticlePublication
    Efficient combinational circuits for division by small integer constants
    (IEEE, 2016) Uğurdağ, Hasan Fatih; Bayram, A.; Levent, Vecdi Levent; Gören, S.; Electrical & Electronics Engineering; UĞURDAĞ, Hasan Fatih; Levent, Vecdi Levent
    Division of an integer by an integer constant is a widely used operation and hence justifies a customized efficient implementation. There are various versions of this operation. This paper attacks a particular version of this problem, where the divisor is small and the circuit outputs a quotient and remainder. We propose a fast (low-latency) yet area-efficient combinational circuit topology, which we call Binary Tree based Constant Division (BTCD). BTCD uses a collection of small LUTs wired to each other to form a binary tree. The circuit also has bunch of adders, whose latencies are almost hidden as they operate in parallel with the binary tree. We wrote RTL code generators for BTCD and two previous works in the literature, then generated circuits for dividends of up to 128 bits and divisors of 3, 5, 11, and 23. We synthesized the generated RTL designs using a commercial ASIC synthesis tool. BTCD strikes a good balance between timing (latency) and area. It is up to 3.3 times better in Area-Timing Product (ATP) compared to the best alternative. ATP has a good correlation with energy consumption.
  • Placeholder
    ArticlePublication
    RImCom: raster-order image compressor for embedded video applications
    (Springer International Publishing, 2017) Palaz, Okan; Uğurdağ, Hasan Fatih; Ozkurt, O.; Kertmen, B.; Donmez, F.; Electrical & Electronics Engineering; UĞURDAĞ, Hasan Fatih; Palaz, Okan
    This paper presents a real-time, rate controlled, end-to-end (encoder and decoder) hardware solution for memory compression of raster-order video streams—named RImCom (short for Raster-order Image Compression). RImCom offers up to 3x compression that is either lossless or lossy at very reasonable PSNR values. The 180 nm ASIC implementation of RImCom achieves 28 fps at Ultra-HD resolution in the slow corner of synthesis. RImCom can match the fps of the state-of-the-art in the literature with 20 % less area or can achieve twice the fps with 55 % more area. Our FPGA implementation is the only end-to-end FPGA solution in the literature to achieve to this day over 60 fps at Full-HD resolution and to offer rate control. This work was motivated by video processing applications that require the previous frame(s) besides the current frame. When processing HD video streams, even when only one previous frame is required besides the current frame, a significant size and bandwidth of memory is needed. If the current frame is compressed on-the-fly with RImCom or a similar solution and stored on DRAM, and the previous frame is read from DRAM and decompressed with a small IP block, then the overall system cost, power consumption, and electromagnetic radiation are reduced.
  • Placeholder
    ArticlePublication
    Fast parallel prefix logic circuits for n2n round-robin arbitration
    (Elsevier, 2012-08) Uğurdağ, Hasan Fatih; Baskirt, O.; Electrical & Electronics Engineering; UĞURDAĞ, Hasan Fatih
    An n2n round-robin arbiter (RRA) searches its n inputs for a 1, starting from the highest-priority input. It picks the first 1 and outputs its index in one-hot encoding. RRA aims to be fair to its inputs and maintains fairness by simply rotating the input priorities, i.e., the last arbitrated input becomes the lowest-priority input. Arbiters are used to multiplex the usage of shared resources among requestors as well as in dispatch logic where the purpose is load balancing among multiple resources. Today, arbiters have hundreds of ports and usually need to run at very high clock speeds. This article presents a new gate-level RRA circuit called Thermo Coded-Parallel Prefix Arbiter (TC-PPA) that scales to any number of requestors. It uses parallel prefix network topologies (borrowed from fast carry lookahead adders) to generate a thermometer-coded pointer, thus greatly reducing critical path. Code generators were written not only for TC-PPA but also for the 5 highly competitive circuits in the literature (9 including their variants), and a rich set of timing/area results were obtained using a standard-cell based logic synthesis flow with a novel iterative strategy based on binary search. Synthesis runs include results with wire-load and without. Results show that for 54 or more ports (except 256) TC-PPA offers the best timing (lowest latency) as well as competitive area. Contributions also include transaction-level simulations that show when pipelining is used to boost clock rate, latency and input FIFO sizes are adversely affected, and hence pipelining cannot be indiscriminately exploited to trim clock period.
  • ArticlePublicationOpen Access
    HC-FFT: Highly configurable and efficient FFT implementation on FPGA
    (TÜBİTAK, 2021) Ergül, Pakize; Uğurdağ, Hasan Fatih; Davutoğlu, D.; Electrical & Electronics Engineering; UĞURDAĞ, Hasan Fatih; Ergül, Pakize
    FFT is one of the basic building blocks in many applications such as sensors, radars, communications. For some applications, e.g., real-time spectral monitoring and analysis, FFT needs to be”run-time configurable” so that the system is real-time. When examining the previous work on configurable real-time (FPGA-based) FFT implementations, we see that the degree of configurability is less than what is desired. In this paper, a new FFT architecture is proposed, which has a high degree of run-time configurability and yet does not compromise area or throughput. The configurable parameters of this design are the number of FFT points (up to 64K), forward versus inverse mode, output order (natural or bit-reversed), and the number of streams (up to 4). The proposed FFT architecture (HC-FFT) is designed using a parallel and pipelined radix-2 multipath delay commutator (MDC) FFT structure. HC-FFT was implemented on a Xilinx Kintex Ultrascale FPGA and was verified against the Xilinx FFT IP. Besides its high degree of run-time configurability, HC-FFT is quite efficient and offers a very high throughput of 87 Gbps with a quite reasonable area.
  • Placeholder
    ArticlePublication
    Fast and efficient circuit topologies for finding the maximum of n k-bit numbers
    (IEEE, 2014-08-01) Yüce, B.; Uğurdağ, Hasan Fatih; Gören, S.; Dündar, G.; Electrical & Electronics Engineering; UĞURDAĞ, Hasan Fatih
    Finding the value and/or index of the maximum (or minimum) element of a set of n numbers (each with k-bits) is a fundamental arithmetic operation and is needed in many applications. This paper proposes several maximum-finder (or minimum-finder) circuit topologies, which are parallel. We wrote circuit generators at hardware description language level for our topologies and previous works. Then we synthesized these circuits for 20 different (n, k) cases (with values up to 64) and compared their efficiency in timing (latency), area, and energy. The timing complexity of our fastest topology is O(log n + log k), whereas the fastest in the literature is O(log n log k). The synthesis results showed that our fastest topology is 1.2-2.2 times (1.6 times on the average) faster than the state-of-the-art. In this paper, we argue that a more fair metric of area efficiency is area-timing product. In terms of ATP, our proposed topologies are better than the state-of-the-art in 19 out of the 20 cases. In terms of energy (i.e., power-timing product, abbreviated as PTP), we are better in 11 cases out of 20.
  • Placeholder
    ArticlePublication
    An efficient algorithm for disparity map compression based on spatial correlations and its low-cost hardware architecture
    (Elsevier, 2023-11) Ghanim, Mustafa; Tasdizen, O.; Uğurdağ, Hasan Fatih; Hamzaoğlu, İlker; Electrical & Electronics Engineering; Computer Science; UĞURDAĞ, Hasan Fatih; HAMZAOĞLU, Ilker; Ghanim, Mustafa
    This paper proposes a low-cost disparity map compression algorithm and its hardware architecture for high resolution and high frame rate applications. The proposed algorithm uses spatial correlations between neighboring disparities and has two variants. The first variant encodes the current disparity using its left neighbor, whereas the second variant benefits from left and upper neighbors. The proposed algorithm obtains 48% and 56% savings in memory space on the average for its variants. The proposed architecture can support 1080p resolution at 60 fps on a low-cost FPGA device while consuming very low area and power.
  • Placeholder
    Conference ObjectPublication
    An area efficient real time implementation of dual tree complex wavelet transform in field programmable gate arrays
    (IEEE, 2015) Canbay, F.; Levent, Vecdi Emre; Serbes, G.; Uğurdağ, Hasan Fatih; Goren, S.; Aydin, N.; Electrical & Electronics Engineering; UĞURDAĞ, Hasan Fatih; Levent, Vecdi Emre
    Biomedical signals (BSs), which give information about the normal condition and also the inherent irregularities of our body, are expected to have non-stationary character due to the time-varying behavior of physiological systems. The Fourier transform and the short time Fourier transform are the widely used frequency and time-frequency analysis methods for extracting information from BSs with fixed frequency and time-frequency resolution respectively. However, in order to derive relevant information from non-stationary BSs, an appropriate analysis method which exhibits adjustable time-frequency resolution is needed. The wavelet transform (WT) can be used as a mathematical microscope in which the time-frequency resolution can be adjusted according to the different parts of the signal. The discrete wavelet transform (DWT) is a fast and discretized implementation for classical WT. Due to the aliasing, lack of directionality and shift-variance disadvantages, the DWT exhibits limited performance in the process of BSs. In literature, an improved version of the DWT, which is named as Dual Tree Complex Wavelet Transform (DTCWT), is employed in the analysis of BSs with great success. In this study, considering the improvements in embedded system technology and the needs for wavelet based real-time feature extraction or de-noising systems in portable medical devices, the DTCWT is implemented as a sub-system in field programmable gate arrays. In proposed hardware architecture, for every data input-channel, DTCWT is implemented by using only one adder and one multiplier. Additionally, considering the multi-channel outputs of biomedical data acquisition systems, this architecture is designed with the capability of running in parallel for N channels.