Highlights
- Pro
Stars
Modeling, training, eval, and inference code for OLMo
Efficient, Flexible and Portable Structured Generation
Run LLMs in the Browser with MLC / WebLLM ✨
Mirage: Automatically Generating Fast GPU Kernels without Programming in Triton/CUDA
Introduction to Machine Learning Systems
Run PyTorch LLMs locally on servers, desktop and mobile
Chat with AI large language models running natively in your browser. Enjoy private, server-free, seamless AI conversations.
Qwen2.5 is the large language model series developed by Qwen team, Alibaba Cloud.
An enterprise-grade AI retriever designed to streamline AI integration into your applications, ensuring cutting-edge accuracy.
A cross-platform ChatGPT/Gemini UI (Web / PWA / Linux / Win / MacOS). 一键拥有你自己的跨平台 ChatGPT/Gemini 应用。
A @ClickHouse fork that supports high-performance vector search and full-text search.
asyncio is a c++20 library to write concurrent code using the async/await syntax.
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
SGLang is a fast serving framework for large language models and vision language models.
Social and customizable AI writing assistant! ✍️
User-friendly AI Interface (Supports Ollama, OpenAI API, ...)
An extention of TVMScript to write simple and high performance GPU kernels with tensorcore.
FlashInfer: Kernel Library for LLM Serving
Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
Code repo for "WebArena: A Realistic Web Environment for Building Autonomous Agents"
Vercel and web-llm template to run wasm models directly in the browser.
Letta (formerly MemGPT) is a framework for creating LLM services with memory.