I am a first year Ph.D. student in Computer Science Department at University of California, Los Angeles. I am designing and implementing systems for big data analytics and machine learning. I am a member of SOLAR group and co-advised by Professor Harry Xu and Professor Miryung Kim. I earned my B.E. in Computer Science from Tsinghua University in 2019.
Generally, I am interested in understanding the increasingly complex computer systems, with the ultimate goal of performance improvement. I am working on Computer Systems, especially parallel computing and high performance computing, including thread-level scheduling and parallel graph computing frameworks. In particular, I am focus on following topics:
- High Performance Computing
- Parallel Computing
- Computer Architecture
- Operating Systems and Compilers
I interned in CMU under the supervision of Professor Umut Acar this summer and I have been working with Professor Jidong Zhai in Tsinghua University. Here are some highlights of my research:
Efficient Scheduling with Private Deques in Multiprogrammed Environments
Although traditional private deques based work-stealing algorithm could schedule threads efficiently in monopolized environments with proved bounds on the run time, they could also suffer severe performance loss under multiprogramming as shown by our experiments. We proposed 2 new private-deques algorithms which provided much better performance under multiprogramming without performance degradation under monopolizing. The evaluations showed that our algorithms reached up to 163 times speedup in oversaturated environments and up to 9.3 times speedup in simulated multiprogrammed environments, which was state of the art to our best knowledge. My contributions:
- Implemented three high performance graph benchmarks with Parallel ML, whose performances were comparable with the performances of their state-of-the-art C++ implementations.
- Proposed and implemented 2 new scheduling strategies that could greatly improve the performance of private deques based work stealing schedulers (up to 163x speedup in over-saturated environments and up to 9.3x speedup in simulated multiprogrammed environments).
- Developed a cycler program to simulate multiprogrammed environments and evaluated our new scheduler in over-saturated environments and simulated multiprogrammed environments.
Algorithm-Directed Crash Consistence in Non-volatile Memory for HPC
Though NVMs have promised persistent data storage in memory, the volatile CPU caches can challenge data consistency in case of system crash. We introduced algorithm-based methods for three classic kinds of HPC applications which leveraged algorithm knowledge such as orthogonality relationship between data objects, invariant conditions established by the algorithm and critical intermediate results to detect data consistency with lightweight extension of application data structures or sparse cache blocks flush. Our approach had much lower overhead (at most 8.2% and less than 3% in most cases) compared with traditional checkpoint methods and log-based methods, which could impose significant overhead (at least 4.2% and at most 650%), with the same or less recomputation cost. My contributions:
- Identified key data structures of benchmark algorithms which should be updated transactionally and analyzed drawbacks of undo-logging based transaction. The results were validated by my later evaluations.
- Reimplemented three algorithms, including an iteration solver, a dense matrix multiplication and a Monte-Carlo simulation with persistent memory libraries to establish crash consistency.
- Evaluated the performance of the new programs and identified the bottleneck. These results served as baselines of our new methods.
Algorithm-Directed Crash Consistence in Non-Volatile Memory for HPC. IEEE International Conference on Cluster Computing (CLUSTER 2017)