How to Build a Blazing Fast AI Discord Bot with Python and Local LLMs (No API Keys Required)
Build a Blazing Fast
AI Discord Bot
with Local LLMs
No API keys. No token costs. A fully asynchronous pipeline powered by Ollama and discord.py that keeps your bot lightning-fast and completely lag-free.
If you have spent any time playing around with local Large Language Models, you already know how incredible it is to run AI on your own hardware with zero token costs and total privacy. But running an AI assistant on your desktop is just the beginning.
Why keep it to yourself when you can drop it directly into your Discord server to hang out, moderate, or brainstorm with your entire community? In this guide, we are building a fully functional, intelligent Discord bot using discord.py and Ollama.
The catch with most AI Discord bot tutorials? They use synchronous code that freezes the bot while the AI generates a response. If your bot stops responding to the Discord gateway for too long, it gets disconnected. We are going to fix that by building a fully asynchronous pipeline that keeps your bot fast, responsive, and completely lag-free.
Why Go Local
Why Local LLMs for Discord?
No monthly API fees, no matter how chaotic or active your server chat gets. Zero token costs, ever.
Your community's chat data never leaves your machine or hosting environment. Complete privacy.
Pass custom system prompts to change how your bot behaves on the fly — no redeployments needed.
Architecture
How Data Flows
Getting Started
The Step-by-Step Setup
Before writing the bot script, we need to handle the infrastructure. Follow these steps precisely to get your tokens and local environment ready.
Install Ollama and Pull Your Model
Download and install Ollama for your OS. Once it's running in your terminal, pull a highly efficient model like Llama 3 or Phi-3:
ollama pull llama3
Set Up Your Discord Developer Application
Go to the Discord Developer Portal, click New Application, and give it a name. Navigate to the Bot tab, click Reset Token, and save that token securely — you will need it in your script.
Enable Gateway Intents
Scroll down on the Bot tab to Privileged Gateway Intents. Toggle Message Content Intent to ON so your bot can actually read the server prompts it needs to process.
Install Python Dependencies
Install the required asynchronous libraries. We use discord.py for the API connection and the official ollama async Python package:
pip install discord.py ollama python-dotenv
The Blueprint
bot.py — Full Production Code
Create a file named bot.py and paste the following clean production code. This handles connections, filters out messages from other bots, tracks basic history, and invokes the AI engine without locking up your script execution loops.
import os import discord import ollama from discord.ext import commands from dotenv import load_dotenv # Load environmental variables load_dotenv() TOKEN = os.getenv('DISCORD_TOKEN') MODEL_NAME = 'llama3' # swap for phi3 or mistral # Configure Bot Setup intents = discord.Intents.default() intents.message_content = True bot = commands.Bot(command_prefix="!", intents=intents) @bot.event async def on_ready(): print(f'🤖 {bot.user.name} is online and connected to Discord!') await bot.change_presence(activity=discord.Game(name="with Local Brains")) @bot.event async def on_message(message): # Rule 1: Never reply to yourself or other bots if message.author.bot: return # Check if the bot is explicitly tagged or mentioned if bot.user.mentioned_in(message): # Clean the mention string out of the prompt text clean_prompt = message.content.replace(f'<@{bot.user.id}>', '').strip() if not clean_prompt: await message.channel.send( f"Hey {message.author.mention}! Tag me and ask something like: 'How do I optimize a Python script?'" ) return # Trigger a typing indicator so users know the local engine is working async with message.channel.typing(): try: # Use Ollama's native AsyncClient to prevent blocking the event loop async_client = ollama.AsyncClient() response = await async_client.generate( model=MODEL_NAME, prompt=clean_prompt, system="You are an elite, helpful tech companion inside a Discord server. Keep answers concise and readable." ) ai_reply = response.get('response', 'Something went wrong with my generation loop.') # Handle Discord's 2000-character message limit rule safely if len(ai_reply) > 2000: chunks = [ai_reply[i:i+1900] for i in range(0, len(ai_reply), 1900)] for chunk in chunks: await message.reply(chunk) else: await message.reply(ai_reply) except Exception as e: print(f"Error during AI processing: {e}") await message.reply("Sorry, my local cognitive cores ran into a processing glitch.") await bot.process_commands(message) if __name__ == '__main__': bot.run(TOKEN)
Breaking Down the Magic
The secret sauce here is ollama.AsyncClient(). Here's why it matters:
If you use a standard requests block or standard sync client calls, your entire script stops dead in its tracks while your GPU or CPU crunches the math. To Discord, your bot looks dead — leading to a gateway timeout error and a disconnected bot.
By await-ing the async generation call, we hand control back to the discord.py event loop. While the AI figures out a complex response for one channel, the bot can simultaneously:
Handle other incoming messages and commands in different channels without any delay.
Maintain real-time user presence and activity tracking across the entire server.
Maintain the Discord gateway connection intact — no disconnects, ever.
What's Next?
You can take this project a step further by implementing a SQLite or Firebase database cluster to log thread IDs, creating long-term multi-user memory contexts directly in your server channels. Bookmark BitScript for weekly deep-dives into edge computing, clean code structures, and private AI tools.
Let's Connect
Ran into setup roadblocks or want to share your custom bot configurations? Let's talk about it in the community.
Comments
Post a Comment