Skip to content

[Bug]: vllm-cpu docker gguf: AttributeError: '_OpNamespace' '_C' object has no attribute 'ggml_dequantize' #8500

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task done
Quang-elec44 opened this issue Sep 16, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@Quang-elec44
Copy link

Your current environment

The output of `python collect_env.py`
INFO 09-16 04:16:36 importing.py:10] Triton not installed; certain GPU-related functions will not be available.
PyTorch version: 2.4.0+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0
Clang version: Could not collect
CMake version: version 3.30.3
Libc version: glibc-2.35

Python version: 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.8.0-1015-aws-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        48 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               48
On-line CPU(s) list:                  0-47
Vendor ID:                            AuthenticAMD
Model name:                           AMD EPYC 7R32
CPU family:                           23
Model:                                49
Thread(s) per core:                   2
Core(s) per socket:                   24
Socket(s):                            1
Stepping:                             0
BogoMIPS:                             5600.00
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            768 KiB (24 instances)
L1i cache:                            768 KiB (24 instances)
L2 cache:                             12 MiB (24 instances)
L3 cache:                             96 MiB (6 instances)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-47
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec rstack overflow:   Vulnerable: Safe RET, no microcode
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] intel_extension_for_pytorch==2.4.0+gitfbaa4bc
[pip3] numpy==1.26.4
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0+cpu
[pip3] torchvision==0.19.0+cpu
[pip3] transformers==4.44.2
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.1.post2@fc990f97958636ce25e4471acfd5651b096b0311
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect

Model Input Dumps

No response

🐛 Describe the bug

First, I followed this instruction and built my docker image.
Then I started my container with the below docker-compose.yml file.

services:
  llm-vllm-dev:
    image: vllm/vllm-openai:cpu
    container_name: llm-vllm-dev
    restart: unless-stopped
    environment:
      HUGGING_FACE_HUB_TOKEN: ${HF_TOKEN}
    ports:
      - "8007:8007"
    deploy:
      resources:
        limits:
          cpus: "24"
          memory: 32GB

    ipc: host
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
      - ./models:/models
    networks:
      - ai-assistant
    command: >
      --host 0.0.0.0
      --port 8007
      --api-key <my-api-key>
      --max-model-len 4096
      --tensor-parallel-size 1
      --served-model-name gpt-4o
      --seed 42
      --disable-log-requests
      --quantization gguf
      --model /models/Llama-3.1-Storm-8B.Q4_K_M.gguf

networks:
  ai-assistant:
    external: true

My running script:

import openai

BASE_URL="http://localhost:8007/v1" # port 8000 or 8005
API_KEY=<my-key>

openai_client = openai.OpenAI(
    base_url=BASE_URL,
    api_key=API_KEY
)
chat_template = "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}"

completion = openai_client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "Hello"}
    ],
    temperature=0.0,
    n=1,
    seed=42,
    max_tokens=2048,
    extra_body={
        "chat_template": chat_template
    },
)

print(completion.choices[0].message.content)

The I got the error:

llm-vllm-dev  | INFO 09-16 04:20:27 server.py:228] vLLM ZMQ RPC Server was interrupted.
llm-vllm-dev  | Future exception was never retrieved
llm-vllm-dev  | future: <Future finished exception=AttributeError("'_OpNamespace' '_C' object has no attribute 'ggml_dequantize'")>
llm-vllm-dev  | Traceback (most recent call last):
llm-vllm-dev  |   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
llm-vllm-dev  |     async for request_output in results_generator:
llm-vllm-dev  |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 859, in generate
llm-vllm-dev  |     async for output in await self.add_request(
llm-vllm-dev  |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 106, in generator
llm-vllm-dev  |     raise result
llm-vllm-dev  |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 48, in _log_task_completion
llm-vllm-dev  |     return_value = task.result()
llm-vllm-dev  |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 733, in run_engine_loop
llm-vllm-dev  |     result = task.result()
llm-vllm-dev  |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 673, in engine_step
llm-vllm-dev  |     request_outputs = await self.engine.step_async(virtual_engine)
llm-vllm-dev  |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 340, in step_async
llm-vllm-dev  |     outputs = await self.model_executor.execute_model_async(
llm-vllm-dev  |   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/cpu_executor.py", line 314, in execute_model_async
llm-vllm-dev  |     output = await make_async(self.execute_model
llm-vllm-dev  |   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
llm-vllm-dev  |     result = self.fn(*self.args, **self.kwargs)
llm-vllm-dev  |   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/cpu_executor.py", line 226, in execute_model
llm-vllm-dev  |     output = self.driver_method_invoker(self.driver_worker,
llm-vllm-dev  |   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/cpu_executor.py", line 380, in _async_driver_method_invoker
llm-vllm-dev  |     return driver.execute_method(method, *args, **kwargs).get()
llm-vllm-dev  |   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 58, in get
llm-vllm-dev  |     raise self.result.exception
llm-vllm-dev  | AttributeError: '_OpNamespace' '_C' object has no attribute 'ggml_dequantize'

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@Quang-elec44 Quang-elec44 added the bug Something isn't working label Sep 16, 2024
@Quang-elec44 Quang-elec44 changed the title [Bug]: vllm-cpu docker: AttributeError: '_OpNamespace' '_C' object has no attribute 'ggml_dequantize' [Bug]: vllm-cpu docker gguf: AttributeError: '_OpNamespace' '_C' object has no attribute 'ggml_dequantize' Sep 16, 2024
@Isotr0py
Copy link
Collaborator

That's because vllm hasn't supported gguf quantization on CPU backend.

@akhilreddygogula
Copy link

Hi @Isotr0py, @Quang-elec44,
Do you have plans to adding the GGUF quantization on CPU as well ?

@Isotr0py
Copy link
Collaborator

I don't have very much bandwidth to port the CPU kernel right now, especially the GGUF quantization performance on GPU is still under-optimized due to the out-of-date GPU kernel currently. :(

Any contributions to support GGUF quantization on CPU is welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants