A 6 year's rage repository. Where I explain and write easy to understandable inference technique and code to excecute them on European supercomputers like JUPITER, JUWELS and Leonardo.

What you can find?

Most fastest concoction of inference tips and techniques. That are well documented
Code to run them on colab
Code to run them on European Pre-exa and Exa-scale supercomputers.

These are tested and well-documented. I am planning to maintain this repository religiously. Filenames are self-explanatory + if you see a flag like leo or juwel it means it is supercomputer compatible code. I include my slurm scripts as well so have fun.

For colab code: https://colab.research.google.com/drive/17U4lj2YLNH0GdxR9iovBnHdONB4QEh_a?usp=sharing

KVPress: Fastest method according to my test on Italian supercomputer Leonardo. On a single A100 64GB card with 16 CPUs.

Prompt Caching: Prompt Caching is relatively good from my tests but it certainly does not come same as par as KVPress.

Graph Inference: One of my favourite methods that uses torch.compile() to get fast inference speed while use_cache method is turned-on. [Use cache uses KVPress caching] While it shows fast results on Google colab, it was significantly slower on Leonardo.

Prompt Lookup: Again one of my favourite methods and it had the most fastest inference speed of 1.21 seconds.

I am excited to share more methods in the coming future as I find more. Including batch inference technique using Ray lib and continous batching.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Files

README.md

Latest commit

History

README.md

File metadata and controls