Using mlx-lm to run local LLM

Introduction

mlx-lm is a library designed by Apple to optimize running Large Language Models directly on Apple Silicon chips. Compared to Ollama, mlx-lm has superior performance advantages due to its ability to directly access Unified Memory and maximize the power of Apple GPUs, resulting in faster processing speeds and better energy efficiency for Mac users.

Prerequisites

Because mlx-lm was developed specifically for Apple Silicon chips, the following instructions are only applicable if you are using an Apple computer.

Detail

First, install mlx-lm

pip install mlx-lm

Then, visit this HuggingFace page of the mlx community. This is a reputable page sharing LLMs that have been converted from GGUF to MLX to be suitable for running on Macs with Apple Silicon chips. You can search for models that fit your usage needs and machine configuration. Here, I will use the model mlx-community/Qwen2.5-Coder-7B-Instruct-4bit.

The model name mlx-community/Qwen2.5-Coder-7B-Instruct-4bit consists of the following components:

mlx-community is the name of the organization/repo providing the MLX build
Qwen2.5-Coder is the generation 2.5 model series specialized for programming
7B is the model size with 7 billion parameters
Instruct indicates this is a version fine-tuned to understand and follow instructions
4bit means the model has been compressed (quantized) to 4-bit to reduce RAM usage while still maintaining good accuracy.

Next, run the model directly on the terminal to open the chat interface

$ python -m mlx_lm.chat --model mlx-community/Qwen2.5-Coder-7B-Instruct-4bit
Downloading (incomplete total...): 0.00B [00:00, ?B/s]
Fetching 9 files: 100%|███████████████████████████| 9/9 [00:49<00:00,  5.45s/it]
Download complete: : 4.30GB [00:49, 87.6MB/s]             0:49<01:24, 16.83s/it]
[INFO] Starting chat session with mlx-community/Qwen2.5-Coder-7B-Instruct-4bit.
The command list:
- 'q' to exit
- 'r' to reset the chat
- 'h' to display these commands
>> hello
Hello! How can I assist you today?<|im_end|>
>>

This is the command to directly input a prompt to the LLM for immediate execution

$ python -m mlx_lm.generate --model mlx-community/Qwen2.5-Coder-7B-Instruct-4bit --prompt "Give me the Fibonacci algorithm in Python"
Fetching 9 files: 100%|███████████████████████| 9/9 [00:00<00:00, 127100.12it/s]
Download complete: : 0.00B [00:00, ?B/s]                  | 0/9 [00:00<?, ?it/s]
==========
Certainly! The Fibonacci sequence is a series of numbers where each number is the sum of the two preceding ones, usually starting with 0 and 1. Here is a simple implementation of the Fibonacci algorithm in Python:

### Iterative Approach
```python
def fibonacci_iterative(n):
    if n <= 0:
        return 0
    elif n == 1:
        return 1
    else:
        a, b = 0, 1
        for _ in range
==========
Prompt: 36 tokens, 54.020 tokens-per-sec
Generation: 100 tokens, 24.902 tokens-per-sec
Peak memory: 4.367 GB

To use the LLM in Python code, create a main.py file with the following content

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Qwen2.5-Coder-7B-Instruct-4bit")
messages = [{"role": "user", "content": "Give me the Fibonacci algorithm in Python"}]
prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

print("--- AI handling ---")
response = generate(
    model,
    tokenizer,
    prompt=prompt,
    verbose=True,
    max_tokens=500,
)

print("\n\n--- Result ---")
print(response)

The above code performs a 3-step process:

First, load the model and tokenizer into memory
Then, use the Chat Template to format the user's question according to the exact structure required by the model
Finally, call the generate function for the AI to process and create a response based on the prepared prompt.

The results when executed show that the LLM has provided a quite detailed answer

$ poetry run python main.py
Fetching 9 files: 100%|███████████████████████████████████████████| 9/9 [00:00<00:00, 71902.35it/s]
Download complete: : 0.00B [00:00, ?B/s]                                     | 0/9 [00:00<?, ?it/s]
--- AI handling ---
...
--- Result ---
Certainly! The Fibonacci sequence is a series of numbers where each number is the sum of the two preceding ones, usually starting with 0 and 1. Here is a simple implementation of the Fibonacci algorithm in Python:

### Iterative Approach
```python
def fibonacci_iterative(n):
    if n <= 0:
        return 0
    elif n == 1:
        return 1
    else:
        a, b = b, a + b
        for _ in range(2, n + 1):
            a, b = b, a + b
        return b

# Example usage:
n = 10
print(f"Fibonacci number at position {n} is {fibonacci_iterative(n)}")
```

### Recursive Approach
```python
def fibonacci_recursive(n):
    if n <= 0:
        return 0
    elif n == 1:
        return 1
    else:
        return fibonacci_recursive(n - 1) + fibonacci_recursive(n - 2)

# Example usage:
n = 10
print(f"Fibonacci number at position {n} is {fibonacci_recursive(n)}")
```

### Memoization Approach (Optimized Recursive)
```python
def fibonacci_memoization(n, memo={}):
    if n in memo:
        return memo[n]
    if n <= 0:
        return 0
    elif n == 1:
        return 1
    else:
        memo[n] = fibonacci_memoization(n - 1, memo) + fibonacci_memoization(n - 2, memo)
        return memo[n]

# Example usage:
n = 10
print(f"Fibonacci number at position {n} is {fibonacci_memoization(n)}")
```

### Generator Approach
```python
def fibonacci_generator():
    a, b = 0, 1
    while True:
        yield a
        a, b = b, a + b

# Example usage:
n = 10
fib_gen = fibonacci_generator()
for _ in range(n):
    next(fib_gen)
print(f"Fibonacci number at position {n} is {next(fib_gen)}")

To monitor CPU, GPU, and RAM information during usage, you can install and use asitop as follows

brew install asitop
sudo asitop

The monitor interface result will show the utilization levels of CPU, GPU, RAM, and even power consumption as follows

Happy coding!

See more articles here.

Search This Blog

How To Dev