在Ubuntu24.04搭建VLLM， SGLang 和 LangChain环境

[!NOTE] 概述整片文章是笔者的回忆(白天忙碌了一天，晚上进行的总结)，所以有些地方的描述可能有误差，本文更多的是大体方向问题，细节步骤不是本文的重点，见谅 !!!

如何安装Ubuntu 24.04

制作启动U盘，作者使用的是rufus.exe工具
下载Ubuntu24.04的ISO镜像
使用rufus.exe工具刷入Ubuntu22.04.iso文件
插入服务起（或电脑）进入BISO页面，选择U盘作为启动项
选择Try and Install Ubuntu(可能记忆有错误，第一个位置的选项，整片文章是笔者的回忆，所以有些名词可能有误差，见谅)
无脑Next，知道出现分区页面 (这里是重点，一定仔细！！！)
1. 选择手动分区（Manual partitioning，选择带有Manual的选项）
2. 先点击左下角选择启动盘
3. 点击后会出现/et/uefi区，这个分区不要手动修改
4. 然后对下面的内容进行分区，如果你是新手，那么建议
  1. 分区swap大小和你的运存一样就行，如果你的运存512G或者更大，那就无看你情况想给多大，有这么大分区的人，也不是新手，自己肯定也懂
  2. 剩下的都给/就可以了
  3. 这是最简单的，但是绝对不是最有效的
  4. 哦对了，分区格式为/ext4如果你没有特殊需求
无脑Next，等待安装完成
选择重启电脑，然后在退出页面中，先拔出U盘，后敲击键盘Enter按键

如何安装SGLang

安装NVIDIA驱动（重点，不要出错）

首先明确官方推荐的是530版本，但是Ubutnu24.04的内核对它不兼容！！！（血泪的教训，卡了我上一上午）
直接安装最新版的便可以
1. 命令：sudo apt update && sudo apt upgrade 获取并更新安装包的命令，装完系统后记得执行
2. sudo ubuntu-drivers autoinstall, 自动安装适配的版本
  1. 优势：无脑安装，兼容性好
  2. 劣势：如果你的项目不支持这个驱动，嘿嘿，那你就惨了
  3. 小幸运：SGLang支持
  4. 时间：autoinstall的安装版本会随时间变化，截止到2025-5-3安装版本为575.51.03

安装CUDA，（一步一坑的噩梦）

SGLang推荐的CUDA版本是CUDA 12.1版本，但是实测CUDA 12.9也可以

多CUDA环境配置方法

从官网下载.run包

# CUDA 12.9
wget https://developer.download.nvidia.com/compute/cuda/12.9.0/local_installers/cuda_12.9.0_575.51.03_linux.run
sudo sh cuda_12.9.0_575.51.03_linux.run

# CUDA 12.1
 wget https://developer.download.nvidia.com/compute/cuda/12.1.0/local_installers/cuda_12.1.0_530.30.02_linux.run
 sudo sh cuda_12.1.0_530.30.02_linux.run

sudo bash cuda_xxx_linux.run 来进行安装
如果报错
1. cat /path/log.conf，出错会告诉你在哪看日志
2. sudo apt insnstall gcc, g++ #下载对应版本
3. sudo bash cuda_xxx_linux.run --override
安装选项重点
1. 输入accept
2. 然后不要选择安装驱动
3. 进入Option高级选项，选择下载路径，不要使用Sympol链接
修改export, source .bashrc 和将下面自动切换脚本写入.bashrc文件

# ~/.bashrc 或 ~/.zshrc
# 定义一个切换 CUDA 版本的函数
switch_cuda() {
local desired_version=$1
# 假设你的 CUDA 版本都安装在 /usr/local/cuda-X.Y 格式的路径下
local cuda_path="/usr/local/cuda-${desired_version}"

# 检查目标版本的 CUDA 目录是否存在
if [ -d "${cuda_path}" ]; then
   export CUDA_HOME="${cuda_path}"
   export CUDA_PATH="${CUDA_HOME}" # 有些应用可能也看这个变量

   # 从 PATH 和 LD_LIBRARY_PATH 中移除旧的 CUDA 路径 (避免冲突)
   # (注意: 这个移除逻辑可能需要根据你的具体 PATH/LD_LIBRARY_PATH 结构调整)
   export PATH=$(echo "$PATH" | awk -v RS=':' -v ORS=':' '!/\/usr\/local\/cuda-[0-9.]+\/bin/' | sed 's/:$//')
   export LD_LIBRARY_PATH=$(echo "$LD_LIBRARY_PATH" | awk -v RS=':' -v ORS=':' '!/\/usr\/local\/cuda-[0-9.]+\/lib64/' | sed 's/:$//')

   # 添加新版本的 CUDA 路径
   export PATH="${CUDA_HOME}/bin:${PATH}"
   # 确保 LD_LIBRARY_PATH 非空再加冒号
   if [ -z "$LD_LIBRARY_PATH" ]; then
      export LD_LIBRARY_PATH="${CUDA_HOME}/lib64"
   else
      export LD_LIBRARY_PATH="${CUDA_HOME}/lib64:${LD_LIBRARY_PATH}"
   fi

   echo "Switched to CUDA ${desired_version}"
   echo "CUDA_HOME is now: ${CUDA_HOME}"
   echo "Verifying nvcc version:"
   nvcc --version
else
   echo "Error: CUDA version ${desired_version} not found at ${cuda_path}"
fi
}

# (可选) 在启动时设置一个默认的 CUDA 版本
# switch_cuda 12.1

# (可选) 创建别名方便切换
alias cuda12.1="switch_cuda 12.1"
# 假设你的 "12.9" 版本在 /usr/local/cuda-12.9 (请根据实际情况修改)
alias cuda12.9="switch_cuda 12.9"

使用nvcc --version 和 nvidia-smi 检查配置是否完成

[!NOTE] 吐槽真的是一步一坑，最难是的是如果CUDA装错了，想要卸载很容易将驱动也破坏，还要考虑兼容性，Kernel， NVIDIA Driver 和 CUDA三者之间的兼容真的逆天，而且Ubuntu24.04的资料不多，很多都是笔者自己一种搭配一种搭配尝试出来的，哎

安装Miniconda/anaconda

去官网下载相关安装包Anaconda Linux版本
处理多人共享conda环境
1. sudo bash Anaconda3-2024.10-1-Linux-x86_64.sh
  1. 不要使用默认位置，要手动输入修改安装路径为/opt/anaconda3
  2. 注意最后初始化要选则no
  3. sudo chmod -R u=rwX,go=rX /opt/miniconda3 修改权限
将export PATH=/opt/anaconda3/bin:$PATH写入~/.bashrc文件
执行conda init

[!IMPORTANT]赋予用户conda权限为每一个需要运行conda的用户都执行3，4步骤

安装SGLang步骤

conda create --name SGL python=3.10
pip install sglang 需要下载魔塔社区

AI模型安装

import torch
from modelscope import AutoModelForCausalLM, AutoTokenizer
from sglang import srt
from transformers import BitsAndBytesConfig
# 从 ModelScope 加载模型
model = AutoModelForCausalLM.from_pretrained(
   "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
   device_map="auto",
   trust_remote_code=True,
   quantization_config=quant_config
)
tokenizer = AutoTokenizer.from_pretrained(
   "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
   trust_remote_code=True
)

运行模型

python -m sglang.launch_server \     
--model-path /home/bera/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
--port 30000 \
--log-level info \
--trust-remote-code \
--mem-fraction-static 0.7

测试代码click.py

# 1.py (修改后)

import sglang as sgl
import torch # 如果需要清理缓存或者进行其他 PyTorch 操作
# 如果需要处理 Chat Template，保留这个 import
from transformers import AutoTokenizer

# --- 配置 ---
# 模型文件所在的本地路径 (从之前的 ModelScope 下载日志中获取或者从魔塔社区进行下载)
local_model_path = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"

# *** 明确指定 SGLang 后端服务器的地址 (可选，如果使用默认 localhost:30000) ***
sgl.set_default_backend(sgl.RuntimeEndpoint("http://localhost:30000"))

# 定义 SGLang 推理函数
@sgl.function
def generate_poem(s, user_prompt_text):
   # 当 SGLang 服务器已经加载了模型时，客户端的 @sgl.function
   # 【通常不需要】再写 s += sgl.Model(...) 了。
   # 服务器会使用它启动时加载的模型。

   # --- 处理 Chat/Instruction 模型的 Prompt ---
   # 这一步仍然非常重要，需要根据模型要求格式化输入
   # 假设模型是 Qwen 类型，可能需要类似格式 (需要用实际的 Tokenizer 验证)
   try:
      # 仍然可以从本地加载 Tokenizer 来格式化 Prompt
      tokenizer = AutoTokenizer.from_pretrained(local_model_path, trust_remote_code=True) # 确保 tokenizer 加载也信任远程代码

      # 构建适合模型的聊天或指令 Prompt (以 Qwen 为例，具体需核对)
      # messages = [{"role": "user", "content": user_prompt_text}]
      # formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

      # 临时代替：直接使用，效果可能不好，请替换为正确的格式化
      formatted_prompt = f"User: {user_prompt_text}\nAssistant:"

      s += formatted_prompt # 将格式化后的 Prompt 传递给后端
   except Exception as e:
      print(f"警告：无法格式化 Prompt 或加载 Tokenizer: {e}")
      print("将尝试使用原始 Prompt，效果可能不佳。")
      s += user_prompt_text # 备用

   # 使用 sgl.gen 进行生成
   s += sgl.gen(
      "poem_text",            # 结果变量名
      max_tokens=512,
      temperature=0.7
   )

# 准备用户输入
prompt = "你好，请用中文写一首关于夏天的诗。"

# 运行 SGLang 函数 (连接到正在运行的服务器)
# 函数的参数名需要匹配 @sgl.function 定义中的参数名 (除了第一个 's')
state = generate_poem.run(user_prompt_text=prompt)

# 打印结果
print(state["poem_text"])

# （可选）清理缓存
# torch.cuda.empty_cache()

如果报错
1. conda update -c conda-forge libstdcxx-ng --force-reinstall

安装VLLM

[!Note] 注意 We'll assume you want to use a common 7B Deepseek model available on Hugging Face, like deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct (optimized for coding) or deepseek-ai/DeepSeek-LLM-7B-Chat (optimized for general chat). I'll use deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct in the examples, but you can substitute the other name if you prefer.

Prerequisites:

NVIDIA GPU: VLLM requires an NVIDIA GPU with sufficient VRAM (A 7B parameter model generally needs at least 16GB VRAM, possibly more depending on quantization and configuration, but check VLLM docs for specifics).
CUDA Toolkit: Ensure you have a compatible NVIDIA driver and CUDA Toolkit installed. VLLM's installation usually handles CUDA dependencies via PyPI if possible, but a base driver is needed.
Python: Python 3.8 or newer.
pip: Python's package installer.

Step-by-Step Guide:

Step 1: Install VLLM

Open your terminal or command prompt and install VLLM using pip.

pip install vllm

Note: This installation might take some time as it might compile CUDA kernels specific to your GPU architecture if pre-built wheels aren't available or suitable.

Step 2: Install LangChain and OpenAI Client Library

You need LangChain core and the OpenAI client library because VLLM provides an OpenAI-compatible API endpoint. LangChain uses this client library to interact with that endpoint.

pip install langchain langchain-openai

Step 3: Start the VLLM Server

Now, launch the VLLM server, telling it to load and serve the Deepseek model. VLLM will automatically download the model from Hugging Face Hub if it's not already cached locally.

Open a new terminal window (keep this one running for the server).

Run the following command:

python -m vllm.entrypoints.openai.api_server \
    --model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct \
    --trust-remote-code # Deepseek models might require this
    # Optional: Add --tensor-parallel-size N if you have multiple GPUs (N=number of GPUs)
    # Optional: Add --port 8000 if you want to specify the port (default is 8000)

Explanation:
- python -m vllm.entrypoints.openai.api_server: Runs VLLM's built-in OpenAI-compatible server.
- --model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct: Specifies the Hugging Face model repository ID to load. Replace this if you want deepseek-ai/DeepSeek-LLM-7B-Chat.
- --trust-remote-code: Often necessary for models that include custom code in their Hugging Face repositories.
Wait: VLLM will download the model (if needed) and then start the server. You should see output indicating the model loading process and finally something like: INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) or similar, indicating the server is ready. Note the address and port (usually localhost or 0.0.0.0 and port 8000). The OpenAI API endpoint will be at http://localhost:8000/v1.

Step 4: Connect LangChain to the VLLM Server

Now, in a separate terminal or Python script/notebook, you can write LangChain code to connect to the running VLLM server.

from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage

# Define the model name exactly as loaded by VLLM
# This is usually the Hugging Face repo ID unless you used --served-model-name in VLLM
MODEL_NAME = "deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct"
# Alternative if you loaded the chat model:
# MODEL_NAME = "deepseek-ai/DeepSeek-LLM-7B-Chat"

# Point LangChain to the VLLM Server
# NOTE: api_key is required by the OpenAI client, but VLLM doesn't use it.
#       Use a dummy value like "dummy".
llm = ChatOpenAI(
    model=MODEL_NAME,
    openai_api_key="dummy",
    openai_api_base="http://localhost:8000/v1", # VLLM's OpenAI compatible endpoint
    max_tokens=100, # Optional: set max generation tokens
    temperature=0.7, # Optional: Set generation temperature
)

print("Connected to VLLM server. Sending request...")

# Prepare messages for the chat model
messages = [
    SystemMessage(content="You are a helpful coding assistant based on Deepseek Coder V2 Lite."),
    HumanMessage(content="Write a simple Python function to calculate the factorial of a number."),
]

# Send the request to the VLLM server via LangChain
try:
    response = llm.invoke(messages)
    print("\nResponse from VLLM:")
    print(response.content)

except Exception as e:
    print(f"An error occurred: {e}")
    print("Check if the VLLM server is running and the API base URL is correct.")

# --- Example 2: Simple invocation ---
print("\nSending another simple prompt:")
try:
    simple_response = llm.invoke("What is the capital of France?")
    print("\nResponse from VLLM:")
    print(simple_response.content)
except Exception as e:
    print(f"An error occurred: {e}")

Step 5: Run the LangChain Code

Save the Python code above into a file (e.g., test_vllm_langchain.py).
Make sure the VLLM server from Step 3 is still running in its terminal.
Open a new terminal (or use the one where you installed the libraries), navigate to where you saved the file, and run:
```
python test_vllm_langchain.py
```

You should see the output from the script, including the response generated by the Deepseek model served via VLLM. Check the VLLM server terminal window; you'll likely see logs indicating incoming requests and processing details.

Troubleshooting & Tips:

GPU Memory: If VLLM fails to start or crashes, you might not have enough GPU VRAM. Check nvidia-smi in the terminal while VLLM is loading the model. You might need a smaller model, quantization (VLLM supports some methods, check their docs), or a GPU with more memory.
Model Name Mismatch: Ensure the model parameter in ChatOpenAI exactly matches the Hugging Face ID you told VLLM to load (e.g., deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct).
API Base URL: Double-check the openai_api_base URL. It should point to the address and port where VLLM is running, followed by /v1. If VLLM is running on a different machine, replace localhost with that machine's IP address. Ensure firewalls aren't blocking the connection.
--trust-remote-code: Some models require this flag to load correctly. If VLLM fails to load the model, try adding this flag when starting the server.
VLLM Server Logs: If the LangChain code fails to connect or gets errors, check the terminal output of the VLLM server process (from Step 3) for detailed error messages.
Dependencies: Ensure CUDA and Python versions meet VLLM's requirements (check the latest VLLM documentation on GitHub or PyPI).

安装LangChain

conda create --name langchain python=3.10
pip install langchain langchain-openai

通用测试脚本

from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage, AIMessage
import os
import langchain # Import for debugging

# --- Setup llm as before ---
sglang_api_base_url = "http://localhost:30000/v1" # <----- change url to adaptive your back url
sglang_model_name = "DeepSeep-R1-Distill-Qwen-1.5" # <------ chage model name to adaptive your model_name
sglang_api_key = ""

llm = ChatOpenAI(
   model=sglang_model_name,
   openai_api_key=sglang_api_key,
   openai_api_base=sglang_api_base_url,
   temperature=0.7,
   max_tokens=512,
)

# --- Conversation Example ---

# Start with system message and first question
conversation_history = [
   SystemMessage(content="You are a helpful assistant discussing machine learning."),
   HumanMessage(content="What is quantization?")
]

print("--- Sending first question ---")
first_response = llm.invoke(conversation_history)
print("First response:", first_response.content)

# Add the first response (as AIMessage) and the second question (as HumanMessage)
conversation_history.append(AIMessage(content=first_response.content)) # Add AI's answer
conversation_history.append(HumanMessage(content="How does it help make models faster on edge devices?")) # Add user's second question

print("\n--- Sending second question (with context) ---")
# The 'conversation_history' now contains the system prompt, Q1, A1, and Q2
second_response = llm.invoke(conversation_history)
print("Second response:", second_response.content)

# The 'conversation_history' list now contains:
# [SystemMessage, HumanMessage (Q1), AIMessage (A1), HumanMessage (Q2)]
# When invoking for the second response, LangChain sends this entire list.

菜单

分享

（转）在Ubuntu24.04搭建VLLM， SGLang 和 LangChain环境

在Ubuntu24.04搭建VLLM， SGLang 和 LangChain环境

如何安装Ubuntu 24.04

如何安装SGLang

安装NVIDIA驱动（重点，不要出错）

安装CUDA，（一步一坑的噩梦）

安装Miniconda/anaconda

安装SGLang步骤

安装VLLM

Prerequisites:

Step-by-Step Guide:

Step 1: Install VLLM

Step 2: Install LangChain and OpenAI Client Library

Step 3: Start the VLLM Server

Step 4: Connect LangChain to the VLLM Server

Step 5: Run the LangChain Code

安装LangChain

Author: Bear
Written: May 4, 2025
Last Updated: May 4, 2025, 9:01 AM SGT

评论

红米AX3000(RA81)免拆机开启SSH刷入OpenWrt教程

[故障排查] Windows 11 移动热点无法获取 IP 及无网络问题的深度分析与解决

蓝桥杯国赛结束了

部署SSL证书配合nginx实现https加密

doc88.com逆向（转载自@Bear 🐻）

openlist挂载中国联通云盘

为什么同一口井的 GR 曲线，会长出三条完全不同的 INPEFA 曲线？——从一段 Python 代码讲透全局、局部与分组视角

基于 Termux 与 Tmoe 在 Android 上部署 Ubuntu 24.04 (Chroot)

HomeLab 折腾记：Authentik + Music-Tag-Web 完美部署实践

软件开发移动工作站选购

分享

（转）在Ubuntu24.04搭建VLLM， SGLang 和 LangChain环境

在Ubuntu24.04搭建VLLM， SGLang 和 LangChain环境

如何安装Ubuntu 24.04

如何安装SGLang

安装NVIDIA驱动（重点，不要出错）

安装CUDA，（一步一坑的噩梦）

安装Miniconda/anaconda

安装SGLang步骤

安装VLLM

Prerequisites:

Step-by-Step Guide:

Step 1: Install VLLM

Step 2: Install LangChain and OpenAI Client Library

Step 3: Start the VLLM Server

Step 4: Connect LangChain to the VLLM Server

Step 5: Run the LangChain Code

安装LangChain

Author: BearWritten: May 4, 2025Last Updated: May 4, 2025, 9:01 AM SGT

评论

Author: Bear
Written: May 4, 2025
Last Updated: May 4, 2025, 9:01 AM SGT