The purpose of this repository is to let people to use lots of open sourced instruction-following fine-tuned LLM models as a Chatbot service. Because different models behave differently, and different models require differently formmated prompts, I made a very simple library Ping Pong
for model agnostic conversation and context managements.
Also, I made GradioChat
UI that has a similar shape to HuggingChat but entirely built in Gradio. Those two projects are fully integrated to power this project.
This project has become the one of the default framework at jarvislabs.ai. Jarvislabs.ai is one of the cloud GPU VM provider with the cheapest GPU prices. Furthermore, all the weights of the supported popular open source LLMs are pre-downloaded. You don't need to waste of your money and time to wait until download hundreds of GBs to try out a collection of LLMs. In less than 10 minutes, you can try out any model.
- for further instruction how to run Gradio application, please follow the official documentation on the
llmchat
framework.
dstack
is an open-source tool that allows to run LLM-based apps in a a cloud of your choice via single command. dstack
supports AWS, GCP, Azure, Lambda Cloud, etc.
Use the gradio.dstack.yml
and discord.dstack.yml
configurations to run the Gradio app and Discord bot via dstack
.
- for more details on how to run this repo with
dstack
, read the official documentation bydstack
.
-
Prerequisites
Note that the code only works
Python >= 3.9
andgradio >= 3.32.0
$ conda create -n llm-serve python=3.9 $ conda activate llm-serve
-
Install dependencies.
$ cd LLM-As-Chatbot $ pip install -r requirements.txt
-
Run Gradio application
There is no required parameter to run the Gradio application. However, there are some small details worth being noted. When
--local-files-only
is set, application won't try to look up the Hugging Face Hub(remote). Instead, it will only use the files already downloaded and cached.Hugging Face libraries stores downloaded contents under
~/.cache
by default, and this application assumes so. However, if you downloaded weights in different location for some reasons, you can setHF_HOME
environment variable. Find more about the environment variables hereIn order to leverage internet search capability, you need Serper API Key. You can set it manually in the control panel or in CLI. When specifying the Serper API Key in CLI, it will be injected into the corresponding UI control. If you don't have it yet, please get one from serper.dev. By signing up, you will get free 2,500 free google searches which is pretty much sufficient for a long-term test.
$ python app.py --root-path "" \ --local-files-only \ --share \ --debug \ --serper-api-key "YOUR SERPER API KEY"
-
Prerequisites
Note that the code only works
Python >= 3.9
$ conda create -n llm-serve python=3.9 $ conda activate llm-serve
-
Install dependencies.
$ cd LLM-As-Chatbot $ pip install -r requirements.txt
-
Run Discord Bot application. Choose one of the modes in
--mode-[cpu|mps|8bit|4bit|full-gpu]
.full-gpu
will be choseon by default(full
meanshalf
- consider this as a typo to be fixed later).The
--token
is a required parameter, and you can get it from Discord Developer Portal. If you have not setup Discord Bot from the Discord Developer Portal yet, please follow How to Create a Discord Bot Account section of the tutorial from freeCodeCamp to get the token.The
--model-name
is a required parameter, and you can look around the list of supported models frommodel_cards.json
.--max-workers
is a parameter to determine how many requests to be handled concurrently. This simply defines the value of theThreadPoolExecutor
.When
--local-files-only
is set, application won't try to look up the Hugging Face Hub(remote). Instead, it will only use the files already downloaded and cached.In order to leverage internet search capability, you need Serper API Key. If you don't have it yet, please get one from serper.dev. By signing up, you will get free 2,500 free google searches which is pretty much sufficient for a long-term test. Once you have the Serper API Key, you can specify it in
--serper-api-key
option.- Hugging Face libraries stores downloaded contents under
~/.cache
by default, and this application assumes so. However, if you downloaded weights in different location for some reasons, you can setHF_HOME
environment variable. Find more about the environment variables here
$ python discord_app.py --token "DISCORD BOT TOKEN" \ --model-name "alpaca-lora-7b" \ --max-workers 1 \ --mode-[cpu|mps|8bit|4bit|full-gpu] \ --local_files_only \ --serper-api-key "YOUR SERPER API KEY"
- Hugging Face libraries stores downloaded contents under
-
Supported Discord Bot commands
There is no slash commands. The only way to interact with the deployed discord bot is to mention the bot. However, you can pass some special strings while mentioning the bot.
@bot_name help
: it will display a simple help message@bot_name model-info
: it will display the information of the currently selected(deployed) model from themodel_cards.json
.@bot_name default-params
: it will display the default parameters to be used in model'sgenerate
method. That isGenerationConfig
, and it holds parameters such astemperature
,top_p
, and so on.@bot_name user message --max-new-tokens 512 --temperature 0.9 --top-p 0.75 --do_sample --max-windows 5 --internet
: all parameters are used to dynamically determine the text geneartion behaviour as inGenerationConfig
exceptmax-windows
. Themax-windows
determines how many past conversations to look up as a reference. The default value is set to3
, but as the conversation goes long, you can increase this value.--internet
will try to answer to your prompt by aggregating information scraped from google search. To use--internet
option, you need to specify--serper-api-key
when booting up the program.
Different model might have different strategies to manage context, so if you want to know the exact strategies applied to each model, take a look at the chats
directory. However, here are the basic ideas that I have come up with initially. I have found long prompts will slow down the generation process a lot eventually, so I thought the prompts should be kept as short as possible while as concise as possible at the same time. In the previous version, I have accumulated all the past conversations, and that didn't go well.
- In every turn of the conversation, the past
N
conversations will be kept. Think about theN
as a hyper-parameter. As an experiment, currently the past 2-3 conversations are only kept for all models.
Checkout the list of models
- tloen/alpaca-lora-7b: the original 7B Alpaca-LoRA checkpoint by tloen (updated by 4/4/2022)
- LLMs/Alpaca-LoRA-7B-elina: the 7B Alpaca-LoRA checkpoint by Chansung (updated by 5/1/2022)
- LLMs/Alpaca-LoRA-13B-elina: the 13B Alpaca-LoRA checkpoint by Chansung (updated by 5/1/2022)
- LLMs/Alpaca-LoRA-30B-elina: the 30B Alpaca-LoRA checkpoint by Chansung (updated by 5/1/2022)
- LLMs/Alpaca-LoRA-65B-elina: the 65B Alpaca-LoRA checkpoint by Chansung (updated by 5/1/2022)
- LLMs/AlpacaGPT4-LoRA-7B-elina: the 7B Alpaca-LoRA checkpoint trained on GPT4 generated Alpaca style dataset by Chansung (updated by 5/1/2022)
- LLMs/AlpacaGPT4-LoRA-13B-elina: the 13B Alpaca-LoRA checkpoint trained on GPT4 generated Alpaca style dataset by Chansung (updated by 5/1/2022)
- stabilityai/stablelm-tuned-alpha-7b: StableLM based fine-tuned model
- beomi/KoAlpaca-Polyglot-12.8B: Polyglot based Alpaca style instruction fine-tuned model
- declare-lab/flan-alpaca-xl: Flan XL(3B) based Alpaca style instruction fine-tuned model.
- declare-lab/flan-alpaca-xxl: Flan XXL(11B) based Alpaca style instruction fine-tuned model.
- OpenAssistant/stablelm-7b-sft-v7-epoch-3: StableLM(7B) based OpenAssistant's oasst1 instruction fine-tuned model.
- Writer/camel-5b-hf: Palmyra-base based instruction fine-tuned model. The foundation model and the data are from its creator, Writer.
- lmsys/fastchat-t5-3b-v1.0: T5(3B) based Vicuna style instruction fine-tuned model on SharedGPT by lm-sys
- LLMs/Stable-Vicuna-13B: Stable Vicuna(13B) from Carpel AI and Stability AI. This is not a delta weight, so use it at your own risk. I will make this repo as private soon and add Hugging Face token field.
- LLMs/Vicuna-7b-v1.1: Vicuna(7B) from FastChat. This is not a delta weight, so use it at your own risk. I will make this repo as private soon and add Hugging Face token field.
- LLMs/Vicuna-7b-v1.3
- LLMs/Vicuna-13b-v1.1: Vicuna(13B) from FastChat. This is not a delta weight, so use it at your own risk. I will make this repo as private soon and add Hugging Face token field.
- LLMs/Vicuna-13b-v1.3
- LLMs/Vicuna-33b-v1.3
- togethercomputer/RedPajama-INCITE-Chat-7B-v0.1: RedPajama INCITE Chat(7B) from Together.
- mosaicml/mpt-7b-chat: MPT-7B from MOSAIC ML.
- mosaicml/mpt-30b-chat: MPT-30B from MOSAIC ML.
- teknium/llama-deus-7b-v3-lora: LLaMA 7B based Alpaca style instruction fine-tuned model. The only difference between Alpaca is that this model is fine-tuned on more data including Alpaca dataset, GPTeacher, General Instruct, Code Instruct, Roleplay Instruct, Roleplay V2 Instruct, GPT4-LLM Uncensored, Unnatural Instructions, WizardLM Uncensored, CamelAI's 20k Biology, 20k Physics, 20k Chemistry, 50k Math GPT4 Datasets, and CodeAlpaca
- HuggingFaceH4/starchat-alpha: Starcoder 15.5B based instruction fine-tuned model. This model is particularly good at answering questions about coding.
- HuggingFaceH4/starchat-beta: Starcoder 15.5B based instruction fine-tuned model. This model is particularly good at answering questions about coding.
- LLMs/Vicuna-LoRA-EvolInstruct-7B: LLaMA 7B based Vicuna style instruction fine-tuned model. The dataset to fine-tune this model is from WizardLM's Evol Instruction dataset.
- LLMs/Vicuna-LoRA-EvolInstruct-13B: LLaMA 13B based Vicuna style instruction fine-tuned model. The dataset to fine-tune this model is from WizardLM's Evol Instruction dataset.
- project-baize/baize-v2-7b: LLaMA 7B based Baize
- project-baize/baize-v2-13b: LLaMA 13B based Baize
- timdettmers/guanaco-7b: LLaMA 7B based Guanaco which is fine-tuned on OASST1 dataset with QLoRA techniques introduced in "QLoRA: Efficient Finetuning of Quantized LLMs" paper.
- timdettmers/guanaco-13b: LLaMA 13B based Guanaco which is fine-tuned on OASST1 dataset with QLoRA techniques introduced in "QLoRA: Efficient Finetuning of Quantized LLMs" paper.
- timdettmers/guanaco-33b-merged: LLaMA 30B based Guanaco which is fine-tuned on OASST1 dataset with QLoRA techniques introduced in "QLoRA: Efficient Finetuning of Quantized LLMs" paper.
- tiiuae/falcon-7b-instruct: Falcon 7B based instruction fine-tuned model on Baize, GPT4All, GPTeacher, and RefinedWeb-English datasets.
- tiiuae/falcon-40b-instruct: Falcon 40B based instruction fine-tuned model on Baize and RefinedWeb-English datasets.
- LLMs/WizardLM-13B-V1.0
- LLMs/WizardLM-30B-V1.0
- ehartford/Wizard-Vicuna-13B-Uncensored
- ehartford/Wizard-Vicuna-30B-Uncensored
- ehartford/samantha-7b
- ehartford/samantha-13b
- ehartford/samantha-33b
- CalderaAI/30B-Lazarus
- elinas/chronos-13b
- elinas/chronos-33b
- WizardLM/WizardCoder-15B-V1.0
- ehartford/WizardLM-Uncensored-Falcon-7b
- ehartford/WizardLM-Uncensored-Falcon-40b