A 6 year's rage repository. Where I explain and write easy to understandable inference technique and code to excecute them on European supercomputers like JUPITER, JUWELS and Leonardo. **What you can find?** 1. Most fastest concoction of inference tips and techniques. That are well documented 2. Code to run them on colab 3. Code to run them on European Pre-exa and Exa-scale supercomputers. These are tested and well-documented. I am planning to maintain this repository religiously. Filenames are self-explanatory + if you see a flag like `leo` or `juwel` it means it is supercomputer compatible code. I include my slurm scripts as well so have fun. **For colab code:** https://colab.research.google.com/drive/17U4lj2YLNH0GdxR9iovBnHdONB4QEh_a?usp=sharing `KVPress:` Fastest method according to my test on Italian supercomputer Leonardo. On a single A100 64GB card with 16 CPUs. `Prompt Caching:` Prompt Caching is relatively good from my tests but it certainly does not come same as par as KVPress. `Graph Inference:` One of my favourite methods that uses `torch.compile()` to get fast inference speed while `use_cache` method is turned-on. [Use cache uses KVPress caching] While it shows fast results on Google colab, it was significantly slower on Leonardo. `Prompt Lookup:` Again one of my favourite methods and it had the most fastest inference speed of 1.21 seconds. I am excited to share more methods in the coming future as I find more. Including batch inference technique using Ray lib and continous batching.