Show HN: We made glhf.chat – run almost any open-source LLM, including 405B
14 by reissbaker | 0 comments on Hacker News.
Try it out! https://glhf.chat/ (invite code: 405B) Hey HN! We’ve been working for the past few months on a website to let you easily run (almost) any open-source LLM on autoscaling GPU clusters. It’s free for now while we figure out how to price it, but we expect to be cheaper than most GPU offerings since we can run the models multi-tenant. Unlike Together AI, Fireworks, etc, we’ll run any model that the open-source vLLM project supports: we don’t have a hardcoded list. If you want a specific model or finetune, you don’t have to ask us for it: you can just paste the Hugging Face link in and it’ll work (as long as vLLM supports the base model architecture, we’ll run anything up to ~640GB of VRAM, give or take a little for some overhead buffer). Large models will take a few minutes to boot, but if a bunch of people are trying to use the same model, it might already be loaded and not need boot time at all. The Llama-3-70b finetunes are especially nice, since they’re basically souped-up versions of the 8b finetunes a lot of people like to run locally but don’t have the VRAM for. We’re expecting the Llama-3.1 finetunes to be pretty great too once they start getting released. There are some caveats for now — for example, while we support the Deepseek V2 architecture, we actually can only run their smaller “Lite” models due to some underlying NVLink limitations (though we’re working on it). But for the most part if vLLM supports it, we should too! We figured Llama-3.1-405B Launch Day was a good day to launch ourselves too — let us know in the comments if there’s anything you want us to support, or if you run into any issues. I know it’s not “local” Llama, but, well, that’s a lot of GPUs…
14 by reissbaker | 0 comments on Hacker News.
Try it out! https://glhf.chat/ (invite code: 405B) Hey HN! We’ve been working for the past few months on a website to let you easily run (almost) any open-source LLM on autoscaling GPU clusters. It’s free for now while we figure out how to price it, but we expect to be cheaper than most GPU offerings since we can run the models multi-tenant. Unlike Together AI, Fireworks, etc, we’ll run any model that the open-source vLLM project supports: we don’t have a hardcoded list. If you want a specific model or finetune, you don’t have to ask us for it: you can just paste the Hugging Face link in and it’ll work (as long as vLLM supports the base model architecture, we’ll run anything up to ~640GB of VRAM, give or take a little for some overhead buffer). Large models will take a few minutes to boot, but if a bunch of people are trying to use the same model, it might already be loaded and not need boot time at all. The Llama-3-70b finetunes are especially nice, since they’re basically souped-up versions of the 8b finetunes a lot of people like to run locally but don’t have the VRAM for. We’re expecting the Llama-3.1 finetunes to be pretty great too once they start getting released. There are some caveats for now — for example, while we support the Deepseek V2 architecture, we actually can only run their smaller “Lite” models due to some underlying NVLink limitations (though we’re working on it). But for the most part if vLLM supports it, we should too! We figured Llama-3.1-405B Launch Day was a good day to launch ourselves too — let us know in the comments if there’s anything you want us to support, or if you run into any issues. I know it’s not “local” Llama, but, well, that’s a lot of GPUs…