Publications
ExTensor: An Accelerator for Sparse Tensor Algebra.
IEEE/ACM International Symposium on Microarchitecture (MICRO), 2019.
Application-Transparent Near-Memory Processing Architecture with Memory Channel Network.
IEEE/ACM International Symposium on Microarchitecture (MICRO), 2018.
Chameleon: Versatile and practical near-DRAM acceleration architecture for large memory systems.
IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016.
SpinWise: A Practical Energy-Efficient Synchronization Technique for CMPs.
ACM SIGARCH Computer Architecture News, 44.1 (2016): 1-8..
Near-DRAM Acceleration with Single-ISA Heterogeneous Processing in Standard Memory Modules.
IEEE Micro 36.1 (2016): 24-34.
VR-scale: runtime dynamic phase scaling of processor voltage regulators for improving power efficiency.
IEEE/ACM Design Automation Conference (DAC), 2016
Energy-efficient approximate multiplication for digital signal processing and classification applications.
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2015.
Multiplier supporting accuracy and energy trade-offs for recognition applications.
IET Electronics Letters, vol. 50, no. 7, pp. 512-514, 2014
A novel hardware implementation for joint heart rate, respiration rate, and gait analysis applied to body area networks.
IEEE International Symposium on Circuits and Systems (ISCAS), 2013
Modal Box
This is a sample modal box that can be created using the powers of CSS3.
You could do a lot of things here like have a pop-up ad that shows when your website loads, or create a login/register form for users.
A novel hardware implementation for joint heart rate, respiration rate, and gait analysis applied to body area networks
M. Khazraee ;A. R. Zamani ; M. Hallajian ; S. P. Ehsani ; H. A. Moghaddam ; A. Parsafar ; M. Shabany
Continuous and remote monitoring of vital health-related and physical activity signs of a patient is one of the most important technology-oriented applications to monitor the health-care of ill individuals. In this paper, an innovative framework for a wireless Body Area Network (BAN) system, based on the IEEE 802.15.6 standard, with three types of sensors is proposed and implemented. These include Electrocardiogram (ECG), Force Sensitive Resistor (FSR) and Gyroscope. The proposed design is a novel implementation of an embedded system for the real-time processing and analyzing of the ECG signal, gait phases, and detection of the respiration rate from the ECG signal, by means of small applicable sensors and wireless data communication. Gait analysis is essential for the precise ECG and respiration analysis according to the body posture. A new comprehensive high-speed six-state design is utilized to cover all walking habits. Moreover, the collected data is sent to an external device for further monitoring. The proposed framework is optimized for hardware implementation and targeted for low power applications. The optimized joint implementation of these health-related sensors makes the proposed design distinct from the previous work.
Multiplier supporting accuracy and energy trade-offs for recognition applications
Nam Sung Kim ; Taejoon Park ; Srinivasan Narayanamoorthy ; Hadi Asghari-Moghaddam
The need to support various recognition applications on energy-constrained mobile computing devices has steadily grown. Exploiting common characteristics of recognition algorithms, a very energy-efficient multiplier that can support a runtime trade-off between computational accuracy and energy consumption is proposed. Compared to a precise multiplier, the proposed multiplier consumes 11.6×-3.2× less energy per multiplication while achieving 82-99% of the computational accuracy with negligible negative impact on recognition accuracy for three different recognition applications.
Energy-efficient approximate multiplication for digital signal processing and classification applications.
Srinivasan Narayanamoorthy ; Hadi Asghari-Moghaddam ; Zhenhong Liu ; Taejoon Park ; Nam Sung Kim
The need to support various digital signal processing (DSP) and classification applications on energy-constrained devices has steadily grown. Such applications often extensively perform matrix multiplications using fixed-point arithmetic while exhibiting tolerance for some computational errors. Hence, improving the energy efficiency of multiplications is critical. In this brief, we propose multiplier architectures that can tradeoff computational accuracy with energy consumption at design time. Compared with a precise multiplier, the proposed multiplier can consume 58% less energy/op with average computational error of ∼1%. Finally, we demonstrate that such a small computational error does not notably impact the quality of DSP and the accuracy of classification applications.
VR-Scale: Runtime dynamic phase scaling of processor voltage regulators for improving power efficiency.
Hadi Asghari-Moghaddam, Hamid Reza Ghasemi, Abhishek A. Sinkar, Indrani Paul, and Nam Sung Kim.
A voltage regulator (VR) for a processor is one of the most critical platform components. Particularly, a VR is required to support fast, accurate, and fine-grained voltage changes for efficient processor power management. Such requirements, nonetheless, can be relaxed when a processor consumes low power at runtime. Thus, manufacturers begin to offer some knobs so that a processor can adapt VR's operating parameters to cost-effectively satisfy the requirements with high efficiency. In this paper, we first demonstrate that: (1) VR efficiency heavily depends on load current (i.e., current delivered to a processor) and a VR operating parameter (e.g., the number of active phases) at given voltage; (2) a processor running a parallel application mostly consumes small current due to aggressive power management; and (3) when the processor is in active state, all the phases are always activated in the VR. (2) and (3) in turn lead to poor VR efficiency at most runtime. Second, we present VR-Scale that dynamically scales the number of active phases based on the predicted load current for the next interval. Our evaluations based on an Intel processor running emerging parallel applications show that VR-Scale can reduce the total power consumed by a processor and its VR by more than 19% with negligible performance impact.
Near-DRAM acceleration with single-ISA heterogeneous processing in standard memory modules.
Hadi Asghari-Moghaddam, Amin Farmahini-Farahani, Katherine Morrow, Jung Ho Ahn, and Nam Sung Kim.
Energy consumed for transferring data across the processor memory hierarchy constitutes a large fraction of total system energy consumption, and this fraction has steadily increased with technology scaling. This article presents a near-DRAM acceleration (NDA) architecture wherein lightweight processors (LWPs) with the same ISA as their host processor are 3D-stacked atop commodity DRAM devices in a standard memory module to efficiently process data. In contrast to previous architectures, the authors' NDA architecture requires negligible changes to commodity DRAM device and standard memory module architectures. This allows the NDA to be more easily adopted in both existing and emerging systems. Experiments demonstrate that, on average, the authors' NDA-based system consumes almost 65 percent less energy at nearly two times higher performance than the baseline system.
SpinWise: A Practical Energy-Efficient Synchronization Technique for CMPs.
Hadi Asghari-Moghaddam, and Nam Sung Kim.
Spinning had been the classical way of implementing synchronization primitives (i.e., barriers, locks and conditions) in pthread library before the adoption of fast user space mutex (futex). Since spinning cores do not perform any useful work, it has been believed that futex is more energy-efficient than spinning. In this paper, using commercial chip multi-processors (CMPs), first we provide deep insights on how the commercial CMP and operating system together reduce power consumption during spinning- and futex-based synchronization and analyze the duration of synchronization cycles for each implementation. Second, we analyze limitations of existing techniques that attempt to reduce power consumption of CMPs during synchronization. Finally, we propose a spinning-based energy-efficient synchronization technique dubbed SpinWise. We demonstrate that SpinWise can provide 22% higher geometric mean energy efficiency than futex for a CMP running applications with many frequent and short synchronization events.
Chameleon: Versatile and Practical Near-DRAM Acceleration Architecture for Large Memory Systems.
Hadi Asghari-Moghaddam, Young Hoon Son, Jung Ho Ahn, and Nam Sung Kim,
The performance of computer systems is often limited by the bandwidth of their memory channels, but further increasing the bandwidth is challenging under the stringent pin and power constraints of packages. To further increase performance under these constraints, various near-DRAM acceleration (NDA) architectures, which tightly integrate accelerators with DRAM devices using 3D/2.5D-stacking technology, have been proposed. However, they have not prevailed yet because they often rely on expensive HBM/HMC like DRAM devices which also suffer from limited capacity, whereas the scalability of memory capacity is critical for some computing segments such as servers. In this paper, we first demonstrate that data buffers in a load-reduced DIMM (LRDIMM), which is originally developed to support large memory systems for servers, are supreme places to integrate near-DRAM accelerators. Second, we propose Chameleon, an NDA architecture that can be realized without relying on 3D/2.5D-stacking technology and seamlessly integrated with large memory systems for servers. Third, we explore three microarchitectures that abate constraints imposed by taking LRDIMM architecture for NDA. Our experiment demonstrates that a Chameleon-based system can offer 2.13× higher geo-mean performance while consuming 34% lower geo-mean data transfer energy than a system that integrates the same accelerator logic within the processor.
Application-Transparent Near-Memory Processing Architecture with Memory Channel Network.
Mohammad Alian , Seung Won Min, Hadi Asghari-Moghaddam, Ashutosh Dhar, Dong Kai Wang, Thomas Roewer, Adam McPadden, Oliver OHalloran, Deming Chen, Jinjun Xiong, Daehoon Kim, Wen-mei Hwu, Nam Sung Kim.
The physical memory capacity of servers is expected to increase drastically with deployment of the forthcoming non-volatile memory technologies. This is a welcomed improvement for emerging data-intensive applications. For such servers to be cost-effective, nonetheless, we must cost-effectively increase compute throughput and memory bandwidth commensurate with the increase in memory capacity without compromising application readiness. Tackling this challenge, we present Memory Channel Network (MCN) architecture in this paper. Specifically, first, we propose an MCN DIMM, an extension of a buffered DIMM where a small but capable processor called MCN processor is integrated with a buffer device on the DIMM for near-memory processing. Second, we implement device drivers to give the host and MCN processors in a server an illusion that they are independent heterogeneous nodes connected through an Ethernet link. These allow the host and MCN processors in a server to run a given data-intensive application together based on popular distributed computing frameworks such as MPI and Spark without any change in the host processor hardware and its application software, while offering the benefits of high-bandwidth and low-latency communications between the host and the MCN processors over memory channels. As such, MCN can serve as an application-transparent framework which can seamlessly unify near-memory processing within a server and distributed computing across such servers for data-intensive applications. Our simulation running the full software stack shows that a server with 8 MCN DIMMs offers 4.56X higher throughput and consume 47.5% less energy than a cluster with 9 conventional nodes connected through Ethernet links, as it facilitates up to 8.17X higher aggregate DRAM bandwidth utilization. Lastly, we demonstrate the feasibility of MCN with an IBM POWER8 system and an experimental buffered DIMM.
ExTensor: An Accelerator for Sparse Tensor Algebra
Kartik Hegde, Hadi Asghari-Moghaddam, Michael Pellauer, Neal Crago, Aamer Jaleel, Edgar Solomonik, Joel Emer, and Christopher W. Fletcher
Generalized tensor algebra is a prime candidate for acceleration via customized ASICs. Modern tensors feature a wide range of data sparsity, with the density of non-zero elements ranging from 10-6% to 50%. This paper proposes a novel approach to accelerate tensor kernels based on the principle of hierarchical elimination of computation in the presence of sparsity. This approach relies on rapidly finding intersections---situations where both operands of a multiplication are non-zero---enabling new data fetching mechanisms and avoiding memory latency overheads associated with sparse kernels implemented in software. We propose the ExTensor accelerator, which builds these novel ideas on handling sparsity into hardware to enable better bandwidth utilization and compute throughput. We evaluate ExTensor on several kernels relative to industry libraries (Intel MKL) and state-of-the-art tensor algebra compilers (TACO). When bandwidth normalized, we demonstrate an average speedup of 3.4×, 1.3×, 2.8×, 24.9×, and 2.7× on SpMSpM, SpMM, TTV, TTM, and SDDMM kernels respectively over a server class CPU.
Contact me
Feel free to contact me if you have any question regarding my papers, research, job availability, or any random question.
-
Address
4111 Siebel Center
201 N Goodwin Ave.
Urbana, IL 61801
USA -
Email
asghari2 [at] illinois.edu -
Social