在Ubuntu24.04搭建VLLM, SGLang 和 LangChain环境
[!NOTE] 概述 整片文章是笔者的回忆(白天忙碌了一天,晚上进行的总结),所以有些地方的描述可能有误差,本文更多的是大体方向问题,细节步骤不是本文的重点,见谅 !!!
如何安装Ubuntu 24.04
制作启动
U盘,作者使用的是rufus.exe工具下载
Ubuntu24.04的ISO镜像使用
rufus.exe工具刷入Ubuntu22.04.iso文件插入服务起(或电脑)进入
BISO页面,选择U盘作为启动项选择
Try and Install Ubuntu(可能记忆有错误,第一个位置的选项,整片文章是笔者的回忆,所以有些名词可能有误差,见谅)无脑
Next,知道出现分区页面 (这里是重点,一定仔细!!!)选择手动分区(
Manual partitioning,选择带有Manual的选项)先点击左下角选择启动盘
点击后会出现
/et/uefi区,这个分区不要手动修改然后对下面的内容进行分区,如果你是新手,那么建议
分区
swap大小和你的运存一样就行,如果你的运存512G或者更大,那就无看你情况想给多大,有这么大分区的人,也不是新手,自己肯定也懂剩下的都给
/就可以了这是最简单的,但是绝对不是最有效的
哦对了,分区格式为
/ext4如果你没有特殊需求
无脑Next,等待安装完成
选择重启电脑,然后在退出页面中,先
拔出U盘,后敲击键盘Enter按键
如何安装SGLang
安装NVIDIA驱动(重点,不要出错)
首先明确官方推荐的是
530版本,但是Ubutnu24.04的内核对它不兼容!!!(血泪的教训,卡了我上一上午)直接安装最新版的便可以
命令:
sudo apt update && sudo apt upgrade获取并更新安装包的命令,装完系统后记得执行sudo ubuntu-drivers autoinstall, 自动安装适配的版本优势:无脑安装,兼容性好
劣势:如果你的项目不支持这个驱动,嘿嘿,那你就惨了
小幸运:
SGLang支持时间:
autoinstall的安装版本会随时间变化,截止到2025-5-3安装版本为575.51.03
安装CUDA,(一步一坑的噩梦)
SGLang推荐的CUDA版本是CUDA 12.1版本,但是实测CUDA 12.9也可以多CUDA环境配置方法
从官网下载
.run包
# CUDA 12.9 wget https://developer.download.nvidia.com/compute/cuda/12.9.0/local_installers/cuda_12.9.0_575.51.03_linux.run sudo sh cuda_12.9.0_575.51.03_linux.run # CUDA 12.1 wget https://developer.download.nvidia.com/compute/cuda/12.1.0/local_installers/cuda_12.1.0_530.30.02_linux.run sudo sh cuda_12.1.0_530.30.02_linux.runsudo bash cuda_xxx_linux.run来进行安装如果报错
cat /path/log.conf,出错会告诉你在哪看日志sudo apt insnstall gcc, g++#下载对应版本sudo bash cuda_xxx_linux.run --override
安装选项 重点
输入
accept然后不要选择安装驱动
进入
Option高级选项,选择下载路径,不要使用Sympol链接
修改
export,source .bashrc和将下面自动切换脚本写入.bashrc文件
# ~/.bashrc 或 ~/.zshrc # 定义一个切换 CUDA 版本的函数 switch_cuda() { local desired_version=$1 # 假设你的 CUDA 版本都安装在 /usr/local/cuda-X.Y 格式的路径下 local cuda_path="/usr/local/cuda-${desired_version}" # 检查目标版本的 CUDA 目录是否存在 if [ -d "${cuda_path}" ]; then export CUDA_HOME="${cuda_path}" export CUDA_PATH="${CUDA_HOME}" # 有些应用可能也看这个变量 # 从 PATH 和 LD_LIBRARY_PATH 中移除旧的 CUDA 路径 (避免冲突) # (注意: 这个移除逻辑可能需要根据你的具体 PATH/LD_LIBRARY_PATH 结构调整) export PATH=$(echo "$PATH" | awk -v RS=':' -v ORS=':' '!/\/usr\/local\/cuda-[0-9.]+\/bin/' | sed 's/:$//') export LD_LIBRARY_PATH=$(echo "$LD_LIBRARY_PATH" | awk -v RS=':' -v ORS=':' '!/\/usr\/local\/cuda-[0-9.]+\/lib64/' | sed 's/:$//') # 添加新版本的 CUDA 路径 export PATH="${CUDA_HOME}/bin:${PATH}" # 确保 LD_LIBRARY_PATH 非空再加冒号 if [ -z "$LD_LIBRARY_PATH" ]; then export LD_LIBRARY_PATH="${CUDA_HOME}/lib64" else export LD_LIBRARY_PATH="${CUDA_HOME}/lib64:${LD_LIBRARY_PATH}" fi echo "Switched to CUDA ${desired_version}" echo "CUDA_HOME is now: ${CUDA_HOME}" echo "Verifying nvcc version:" nvcc --version else echo "Error: CUDA version ${desired_version} not found at ${cuda_path}" fi } # (可选) 在启动时设置一个默认的 CUDA 版本 # switch_cuda 12.1 # (可选) 创建别名方便切换 alias cuda12.1="switch_cuda 12.1" # 假设你的 "12.9" 版本在 /usr/local/cuda-12.9 (请根据实际情况修改) alias cuda12.9="switch_cuda 12.9"使用
nvcc --version和nvidia-smi检查配置是否完成
[!NOTE] 吐槽 真的是一步一坑,最难是的是如果CUDA装错了,想要卸载很容易将驱动也破坏,还要考虑兼容性,Kernel, NVIDIA Driver 和 CUDA三者之间的兼容真的逆天,而且Ubuntu24.04的资料不多,很多都是笔者自己一种搭配一种搭配尝试出来的,哎
安装Miniconda/anaconda
去官网下载相关安装包Anaconda Linux版本
处理多人共享conda环境
sudo bash Anaconda3-2024.10-1-Linux-x86_64.sh不要使用默认位置,要手动输入修改安装路径为
/opt/anaconda3注意最后初始化要选则
nosudo chmod -R u=rwX,go=rX /opt/miniconda3修改权限
将
export PATH=/opt/anaconda3/bin:$PATH写入~/.bashrc文件执行
conda init
[!IMPORTANT]赋予用户conda权限 为每一个需要运行conda的用户都执行
3,4步骤
安装SGLang步骤
conda create --name SGL python=3.10pip install sglang需要下载魔塔社区AI模型安装
import torch from modelscope import AutoModelForCausalLM, AutoTokenizer from sglang import srt from transformers import BitsAndBytesConfig # 从 ModelScope 加载模型 model = AutoModelForCausalLM.from_pretrained( "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B", device_map="auto", trust_remote_code=True, quantization_config=quant_config ) tokenizer = AutoTokenizer.from_pretrained( "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B", trust_remote_code=True )运行模型
python -m sglang.launch_server \ --model-path /home/bera/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \ --port 30000 \ --log-level info \ --trust-remote-code \ --mem-fraction-static 0.7测试代码
click.py# 1.py (修改后) import sglang as sgl import torch # 如果需要清理缓存或者进行其他 PyTorch 操作 # 如果需要处理 Chat Template,保留这个 import from transformers import AutoTokenizer # --- 配置 --- # 模型文件所在的本地路径 (从之前的 ModelScope 下载日志中获取或者从魔塔社区进行下载) local_model_path = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B" # *** 明确指定 SGLang 后端服务器的地址 (可选,如果使用默认 localhost:30000) *** sgl.set_default_backend(sgl.RuntimeEndpoint("http://localhost:30000")) # 定义 SGLang 推理函数 @sgl.function def generate_poem(s, user_prompt_text): # 当 SGLang 服务器已经加载了模型时,客户端的 @sgl.function # 【通常不需要】再写 s += sgl.Model(...) 了。 # 服务器会使用它启动时加载的模型。 # --- 处理 Chat/Instruction 模型的 Prompt --- # 这一步仍然非常重要,需要根据模型要求格式化输入 # 假设模型是 Qwen 类型,可能需要类似格式 (需要用实际的 Tokenizer 验证) try: # 仍然可以从本地加载 Tokenizer 来格式化 Prompt tokenizer = AutoTokenizer.from_pretrained(local_model_path, trust_remote_code=True) # 确保 tokenizer 加载也信任远程代码 # 构建适合模型的聊天或指令 Prompt (以 Qwen 为例,具体需核对) # messages = [{"role": "user", "content": user_prompt_text}] # formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) # 临时代替:直接使用,效果可能不好,请替换为正确的格式化 formatted_prompt = f"User: {user_prompt_text}\nAssistant:" s += formatted_prompt # 将格式化后的 Prompt 传递给后端 except Exception as e: print(f"警告:无法格式化 Prompt 或加载 Tokenizer: {e}") print("将尝试使用原始 Prompt,效果可能不佳。") s += user_prompt_text # 备用 # 使用 sgl.gen 进行生成 s += sgl.gen( "poem_text", # 结果变量名 max_tokens=512, temperature=0.7 ) # 准备用户输入 prompt = "你好,请用中文写一首关于夏天的诗。" # 运行 SGLang 函数 (连接到正在运行的服务器) # 函数的参数名需要匹配 @sgl.function 定义中的参数名 (除了第一个 's') state = generate_poem.run(user_prompt_text=prompt) # 打印结果 print(state["poem_text"]) # (可选)清理缓存 # torch.cuda.empty_cache()如果报错
conda update -c conda-forge libstdcxx-ng --force-reinstall
安装VLLM
[!Note] 注意 We'll assume you want to use a common 7B Deepseek model available on Hugging Face, like
deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct(optimized for coding) ordeepseek-ai/DeepSeek-LLM-7B-Chat(optimized for general chat). I'll usedeepseek-ai/DeepSeek-Coder-V2-Lite-Instructin the examples, but you can substitute the other name if you prefer.
Prerequisites:
NVIDIA GPU: VLLM requires an NVIDIA GPU with sufficient VRAM (A 7B parameter model generally needs at least 16GB VRAM, possibly more depending on quantization and configuration, but check VLLM docs for specifics).
CUDA Toolkit: Ensure you have a compatible NVIDIA driver and CUDA Toolkit installed. VLLM's installation usually handles CUDA dependencies via PyPI if possible, but a base driver is needed.
Python: Python 3.8 or newer.
pip: Python's package installer.
Step-by-Step Guide:
Step 1: Install VLLM
Open your terminal or command prompt and install VLLM using pip.
pip install vllm
Note: This installation might take some time as it might compile CUDA kernels specific to your GPU architecture if pre-built wheels aren't available or suitable.
Step 2: Install LangChain and OpenAI Client Library
You need LangChain core and the OpenAI client library because VLLM provides an OpenAI-compatible API endpoint. LangChain uses this client library to interact with that endpoint.
pip install langchain langchain-openai
Step 3: Start the VLLM Server
Now, launch the VLLM server, telling it to load and serve the Deepseek model. VLLM will automatically download the model from Hugging Face Hub if it's not already cached locally.
Open a new terminal window (keep this one running for the server).
Run the following command:
python -m vllm.entrypoints.openai.api_server \ --model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct \ --trust-remote-code # Deepseek models might require this # Optional: Add --tensor-parallel-size N if you have multiple GPUs (N=number of GPUs) # Optional: Add --port 8000 if you want to specify the port (default is 8000)Explanation:
python -m vllm.entrypoints.openai.api_server: Runs VLLM's built-in OpenAI-compatible server.--model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct: Specifies the Hugging Face model repository ID to load. Replace this if you wantdeepseek-ai/DeepSeek-LLM-7B-Chat.--trust-remote-code: Often necessary for models that include custom code in their Hugging Face repositories.
Wait: VLLM will download the model (if needed) and then start the server. You should see output indicating the model loading process and finally something like:
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)or similar, indicating the server is ready. Note the address and port (usuallylocalhostor0.0.0.0and port8000). The OpenAI API endpoint will be athttp://localhost:8000/v1.
Step 4: Connect LangChain to the VLLM Server
Now, in a separate terminal or Python script/notebook, you can write LangChain code to connect to the running VLLM server.
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage
# Define the model name exactly as loaded by VLLM
# This is usually the Hugging Face repo ID unless you used --served-model-name in VLLM
MODEL_NAME = "deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct"
# Alternative if you loaded the chat model:
# MODEL_NAME = "deepseek-ai/DeepSeek-LLM-7B-Chat"
# Point LangChain to the VLLM Server
# NOTE: api_key is required by the OpenAI client, but VLLM doesn't use it.
# Use a dummy value like "dummy".
llm = ChatOpenAI(
model=MODEL_NAME,
openai_api_key="dummy",
openai_api_base="http://localhost:8000/v1", # VLLM's OpenAI compatible endpoint
max_tokens=100, # Optional: set max generation tokens
temperature=0.7, # Optional: Set generation temperature
)
print("Connected to VLLM server. Sending request...")
# Prepare messages for the chat model
messages = [
SystemMessage(content="You are a helpful coding assistant based on Deepseek Coder V2 Lite."),
HumanMessage(content="Write a simple Python function to calculate the factorial of a number."),
]
# Send the request to the VLLM server via LangChain
try:
response = llm.invoke(messages)
print("\nResponse from VLLM:")
print(response.content)
except Exception as e:
print(f"An error occurred: {e}")
print("Check if the VLLM server is running and the API base URL is correct.")
# --- Example 2: Simple invocation ---
print("\nSending another simple prompt:")
try:
simple_response = llm.invoke("What is the capital of France?")
print("\nResponse from VLLM:")
print(simple_response.content)
except Exception as e:
print(f"An error occurred: {e}")
Step 5: Run the LangChain Code
Save the Python code above into a file (e.g.,
test_vllm_langchain.py).Make sure the VLLM server from Step 3 is still running in its terminal.
Open a new terminal (or use the one where you installed the libraries), navigate to where you saved the file, and run:
python test_vllm_langchain.py
You should see the output from the script, including the response generated by the Deepseek model served via VLLM. Check the VLLM server terminal window; you'll likely see logs indicating incoming requests and processing details.
Troubleshooting & Tips:
GPU Memory: If VLLM fails to start or crashes, you might not have enough GPU VRAM. Check
nvidia-smiin the terminal while VLLM is loading the model. You might need a smaller model, quantization (VLLM supports some methods, check their docs), or a GPU with more memory.Model Name Mismatch: Ensure the
modelparameter inChatOpenAIexactly matches the Hugging Face ID you told VLLM to load (e.g.,deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct).API Base URL: Double-check the
openai_api_baseURL. It should point to the address and port where VLLM is running, followed by/v1. If VLLM is running on a different machine, replacelocalhostwith that machine's IP address. Ensure firewalls aren't blocking the connection.--trust-remote-code: Some models require this flag to load correctly. If VLLM fails to load the model, try adding this flag when starting the server.VLLM Server Logs: If the LangChain code fails to connect or gets errors, check the terminal output of the VLLM server process (from Step 3) for detailed error messages.
Dependencies: Ensure CUDA and Python versions meet VLLM's requirements (check the latest VLLM documentation on GitHub or PyPI).
安装LangChain
conda create --name langchain python=3.10pip install langchain langchain-openai通用测试脚本
from langchain_openai import ChatOpenAI from langchain_core.messages import HumanMessage, SystemMessage, AIMessage import os import langchain # Import for debugging # --- Setup llm as before --- sglang_api_base_url = "http://localhost:30000/v1" # <----- change url to adaptive your back url sglang_model_name = "DeepSeep-R1-Distill-Qwen-1.5" # <------ chage model name to adaptive your model_name sglang_api_key = "" llm = ChatOpenAI( model=sglang_model_name, openai_api_key=sglang_api_key, openai_api_base=sglang_api_base_url, temperature=0.7, max_tokens=512, ) # --- Conversation Example --- # Start with system message and first question conversation_history = [ SystemMessage(content="You are a helpful assistant discussing machine learning."), HumanMessage(content="What is quantization?") ] print("--- Sending first question ---") first_response = llm.invoke(conversation_history) print("First response:", first_response.content) # Add the first response (as AIMessage) and the second question (as HumanMessage) conversation_history.append(AIMessage(content=first_response.content)) # Add AI's answer conversation_history.append(HumanMessage(content="How does it help make models faster on edge devices?")) # Add user's second question print("\n--- Sending second question (with context) ---") # The 'conversation_history' now contains the system prompt, Q1, A1, and Q2 second_response = llm.invoke(conversation_history) print("Second response:", second_response.content) # The 'conversation_history' list now contains: # [SystemMessage, HumanMessage (Q1), AIMessage (A1), HumanMessage (Q2)] # When invoking for the second response, LangChain sends this entire list.