How to run your own ChatGPT
Concerned about sharing your data on public AI services? Have compliance requirements that restrict their use within your organisation? Discover how to run your own LLM.
Concerns about public AI services
Using public AI services involves several concerns that users and organisations need to be aware of to ensure security, privacy, and compliance, including:
Data privacy
Security risks
Compliance and legal issues
Intellectual property
Cost and Vendor Lock-in
Certainly, there are additional concerns, but these highlight some of the key challenges. Running your own Large Language Model (LLM) can help you overcome many of these issues.
Considerations
There are several ways to run your own LLM-powered AI service. You can choose to use CPU processing for your AI assistant or swap to GPU processing. I did test Distributed Llama, which uses multiple servers to enhance response times, but it didn't meet my expectations. Therefore, in this article, I will focus on setting up a server and utilising GPU processing.
Setup
Step 1: Prepare server
You will need a GPU server, and have SSH access to it. I will use AWS GPU instance, but you might consider some other provider that is a bit cheaper.
If you insist on AWS, few cheapest GPU instances on AWS would be:
g3s.xlarge - NVIDIA Tesla M60 (~$0.3591/h Spot price)
g4dn.xlarge - NVIDIA T4 Tensor Core (~$0.1919/h Spot price)
p2.xlarge - NVIDIA Tesla K80 (~$0.3355/h Spot price)
Choose at least 40GB of storage, depending on the LLM you will be using.
Be sure to choose AMI image that has your NVIDIA drivers installed, otherwise you will need to install them on your own. Instructions for installing the drivers on your own are here.
Step 2: Setup Llama 3
Once you have your instance up and running, lets install Llama 3 and test how it works. 1
curl -fsSL https://ollama.com/install.sh | sh
Result should look like this:
[root@llama ec2-user]# curl -fsSL https://ollama.com/install.sh | sh
>>> Downloading ollama...
#######################################################################
>>> Installing ollama to /usr/bin...
>>> Creating ollama user...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> Enabling and starting ollama service...
Created symlink from /etc/systemd/system/default.target.wants/ollama.service to /etc/systemd/system/ollama.service.
>>> NVIDIA GPU installed.
Step 3: Test Llama
ollama run llama3
My test looked like this:
In above test, llama 3 answered really fast, unlike if we would be using CPU instead of GPU.
Step 4: Setup OpenWebUI
Let’s run OpenWebUI in docker and not go with pip version, since that one is in beta.
Prepare prerequirements:
yum -y install docker docker-runtime-nvidia libnvidia-container-tools
systemctl enable docker
systemctl start docker
Run the OpenWebUI docker:
docker run -d -p 8080:8080 --gpus=all -v ollama:/root/.ollama -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:ollama
Verify that docker container is up and Healthy:
[root@llama ~]# docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
d9a8c31d4ad2 ghcr.io/open-webui/open-webui:ollama "bash start.sh" About a minute ago Up 56 seconds (healthy) 0.0.0.0:8080->8080/tcp, :::8080->8080/tcp open-webui
Step 5: Test OpenWebUI
OpenWebUI exposes port 8080
as we can see above. Let’s connect to it via web browser and register:
Lets fix the model now. On top right click your profile icon and go to Admin panel → Settings → Models and enter one of the models. I chose llama3.1:8b. To verify all other models click here.
Once download process finishes, you can open new chat, choose your model on top left and ask your question.
Be sure your Nvidia drivers are already working, else you will have a bad time. I chose to go with g4dn.xlarge instance and preinstalled drivers, so AMI for this one at time of writing this article in eu-west-1 region was ami-0faa087f5b4d78dc7 - Amazon Linux 2 AMI with NVIDIA TESLA GPU Driver