How to setup Docker model runner on Fedora OS

How to setup Docker model runner on Fedora OS

docker_model_runner_setup

Here in this article we will try to setup Docker model runner to run and manage AI models locally and leverage them in AI applications.

Test Environment

  • Fedora 41 server
  • Docker v28.5.2

Docker Model Runner

Docker Model Runner (DMR) lets you run and manage AI models locally using Docker. Models are pulled from Docker Hub, an OCI-compliant registry, or Hugging Face the first time you use them and are stored locally. They load into memory only at runtime when a request is made, and unload when not in use to optimize resources. Because models can be large, the initial pull may take some time. After that, they’re cached locally for faster access.

If you are interested in watching the video. Here is the YouTube video on the same step by step procedure outlined below.

https://youtu.be/uQ3wWUWKrMM

Procedure

Step1: Ensure Docker installed and running

As a first step ensure that you have docker installed and running on your machine. Follow “Install Docker Engine on Fedora” for the same.

admin@linuxser:~$ docker --version
Docker version 28.5.2, build ecc6942

admin@linuxser:~$ sudo systemctl start docker.service 
admin@linuxser:~$ sudo systemctl status docker.service 

Step2: Install Docker model runner plugin

Here we need to install the “docker-model-plugin” to run and manage AI models locally.

admin@linuxser:~$ sudo dnf install docker-model-plugin
admin@linuxser:~$ docker model version
Client:
 Version:    v1.1.8
 OS/Arch:    linux/amd64

Server:
 Version:    (not reachable)
 Engine:     Docker Engine

Step3: Pull GGUF compatible model image

Docker Model Runner (DMR) allows you to run a wide variety of Large Language Models (LLMs) and generative AI models locally. It functions by pulling models as OCI artifacts and serving them through built-in inference engines like llama.cpp, vLLM, and Diffusers.

Here is the list of supported model formats.

  • GGUF: The primary format for local CPU and GPU inference via the llama.cpp engine.
  • Safetensors: Used for high-throughput production inference via the vLLM engine and for image generation.

Here we are going to pull “bartowski/Llama-3.2-1B-Instruct-GGUF” model which is a repository of quantized versions of Meta’s Llama 3.2 1B parameter instruction-tuned model. These GGUF files are designed to allow the lightweight model to run on local machines.

admin@linuxser:~$ docker model pull hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF
latest: Pulling from docker/model-runner
5f528443f346: Pull complete 
9d7db96ef8a1: Pull complete 
a06403e5c64b: Pull complete 
e7f277a0e57c: Pull complete 
851aa95ecd2d: Pull complete 
4f4fb700ef54: Pull complete 
8039c435bbd8: Pull complete 
bd6e1b796515: Pull complete 
2eef105b568f: Pull complete 
3b4ae614ede8: Pull complete 
5e75d2484e8d: Pull complete 
Digest: sha256:d3d33e63dff5ca93426ff7607b8f174551bb5377e2a6ba82247bbb3f540efa5a
Status: Downloaded newer image for docker/model-runner:latest
Successfully pulled docker/model-runner:latest
Creating model storage volume docker-model-runner-models...
Starting model runner container docker-model-runner...
f2900d93efae: Pull complete [==================================================>]  807.7MB/807.7MB
b33563055168: Pull complete [==================================================>]  24.34kB/24.34kB
6f85a640a97c: Pull complete [==================================================>]  807.7MB/807.7MB
Model pulled successfull

As you can see a docker volume is created to store the models and container is instantiated for docker model runner.

admin@linuxser:~$ docker volume ls
DRIVER    VOLUME NAME
local     docker-model-runner-models
admin@linuxser:~$ docker ps
CONTAINER ID   IMAGE                        COMMAND               CREATED         STATUS                     PORTS                                                     NAMES
085cb95b947c   docker/model-runner:latest   "/app/model-runner"   4 minutes ago   Up 4 minutes (unhealthy)   127.0.0.1:12434->12434/tcp, 172.17.0.1:12434->12434/tcp   docker-model-runner

Step4: Run GGUF compatible model image

Now, we will run the downloaded model and pass a message asking “What is the lastest version of python and its top 5 new features”.

Here is the below response from the model based on the data that it was last trained upon.

admin@linuxser:~$ docker model run hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF "What is the lastest version of python and its top 5 new features"
As of my cut-off knowledge date (December 2023), the latest version of Python is Python 3.10. The Python Software Foundation (PSF) has released Python 3.10 with the following key features:

**Python 3.10:**

1.  **Improved Performance**: Python 3.10 boasts significant performance improvements, particularly when it comes to large datasets.
2.  **New `asyncio` Module**: The `asyncio` module, introduced in Python 3.8, is now a built-in module. This allows for asynchronous programming in Python.
3.  **`concurrent.futures` Module**: This module provides a high-level interface for asynchronously executing callables.
4.  **`types` Module**: The `types` module provides a more comprehensive set of data types than before.
5.  **`asyncio` and `concurrent.futures` Integration**: The `asyncio` and `concurrent.futures` modules are now tightly integrated, making it easier to write asynchronous code.

**Top 5 New Features:**

1.  **Improved Support for C++ and C++17**: Python 3.10 includes improved support for C++ and C++17 features.
2.  **`decimal` Module**: The `decimal` module allows you to represent decimal numbers as fractions, which can be useful for financial calculations.
3.  **`pybind11` Library**: The `pybind11` library allows you to create C++ bindings from Python, making it easier to use C++ functions from Python.
4.  **Improved Support for WebAssembly**: Python 3.10 includes improved support for WebAssembly, allowing you to write WebAssembly applications in Python.
5.  **`trio` Library**: The `trio` library provides a high-level interface for creating, running, and managing Rust programs from Python.

Please note that the information provided is based on my cut-off knowledge date (December 2023) and may not reflect any changes or updates made after that date.

Step5: Load Model

We can also run the model in detached mode. The model will be loaded into memory and will be in running state for 5 minutes after which it is unloaded automatically if there are no requests to that model within that timeframe.

admin@linuxser:~$ docker model run --detach hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF

admin@linuxser:~$ docker model ls
MODEL NAME                                           PARAMETERS  QUANTIZATION   ARCHITECTURE  MODEL ID      CREATED        CONTEXT  SIZE       
huggingface.co/bartowski/llama-3.2-1b-instruct-gguf  1.24B       MOSTLY_Q4_K_M  llama         43a02806ac7a  19 months ago  131072   762.81MiB
admin@linuxser:~$ docker model ps
MODEL NAME                                  BACKEND    MODE        UNTIL               
hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF  llama.cpp  completion  4 minutes from now

Step6: Unload model

We can also manually unload the model using the “docker model unload” as shown below.

admin@linuxser:~$ docker model unload hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF
Unloaded 1 model(s).

Step7: Use Model REST API

Here we are going to use the REST API listing on port “12434” which can be used to communicate with the model and to send message to the model.

admin@linuxser:~$ curl http://localhost:12434/api/chat   -H "Content-Type: application/json"   -d '{
    "model": "hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF",
    "messages": [
      {"role": "user", "content": "What is Huggingface in 10 words"}
    ],
    "stream": false
  }'
{"model":"hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF","created_at":"2026-05-07T11:42:07.597694206Z","message":{"role":"assistant","content":"Huggingface is an open-source AI platform for natural language processing research and development."},"done":true}

We can also create a bidirectional pipe as hack which will accept requests on port “12435” from a remote server and forward it to LLM model host system on port 12434 and send the response back.

admin@linuxser:~$ mkfifo pipe; nc -l -p 12435 < pipe | nc 127.0.0.1 12434 > pipe

Now we can access the REST API remotely as shown below.

admin@fedser:~$ curl http://linuxser.stack.com:12435/api/chat   -H "Content-Type: application/json"   -d '{
    "model": "hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF",
    "messages": [
      {"role": "user", "content": "What is Huggingface in 10 words"}
    ],
    "stream": false
  }'
{"model":"hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF","created_at":"2026-05-07T11:44:43.462776052Z","message":{"role":"assistant","content":"Hugging Face is a platform for open-source AI research and development."},"done":true}

A Unix/Linux FIFO (First-In, First-Out) pipe, or named pipe, is a special file on the filesystem that allows two or more unrelated processes to communicate with each other.

NOTE: It is recommended to setup a reverse proxy such as httpd or nginx with proxypass for remote access to docker model runner as a permanent solution.

Hope you enjoyed reading this article. Thank you..