A library for simplified distributed computing across any heterogeneous architectures (cpu + gpu), with tensor support, and polymorphic data layouts for optimal performance! Ripple enables you to scale quickly without sacrificing performance!
Designed and Developed three cache efficient schedulers for a parallel Matrix Multiplication program. Distributed randomized work-stealing, distributed randomized work-sharing, and centralized work-sharing schedulers were efficiently implemented to optimize performances such as GFLOPS, and L1/L2/L3 cache hits.