Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change the configuration of engineArgs in config.pbtxt #72

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

activezhao
Copy link

No description provided.

Comment on lines 33 to 68
# We need to use decoupled transaction policy for saturating
# vLLM engine for max throughtput.
# TODO [DLIS:5233]: Allow asynchronous execution to lift this
# restriction for cases there is exactly a single response to
# a single request.
model_transaction_policy {
decoupled: True
}

input [
{
name: "text_input"
data_type: TYPE_STRING
dims: [ 1 ]
},
{
name: "stream"
data_type: TYPE_BOOL
dims: [ 1 ]
optional: true
},
{
name: "sampling_parameters"
data_type: TYPE_STRING
dims: [ 1 ]
optional: true
}
]

output [
{
name: "text_output"
data_type: TYPE_STRING
dims: [ -1 ]
}
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've recently added autocomplete to vllm's model.py in this pr: triton-inference-server/vllm_backend#20
Thus, this entry can be eliminated from config.pbtxt

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've recently added autocomplete to vllm's model.py in this pr: triton-inference-server/vllm_backend#20 Thus, this entry can be eliminated from config.pbtxt

Hi @oandreeva-nv Yes, I have seen the code of auto_complete_config.

But in my opinion, this may not be a better way to handle the parameters, because users will not know what parameters should be prepared through the configuration entry of config.pbtxt unless they read model.py.

This approach will indeed simplify config.pbtxt, but I don't think this is a good interaction for users, especially if they are used to using Triton Server. In the ecosystem of Triton Server, this processing logic may make them feel confused, or at least I feel that way.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, inputs, outputs and model_transaction_policy do not change between different vLLM models, thus it might be helpful to auto complete these fields. That was the motivation

if k in model_engine_args:
model_engine_args[k] = float(model_engine_args[k])

int_keys = ["seed", "pipeline_parallel_size", "tensor_parallel_size", "block_size",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you are doing. My only problem is that vLLM repository is under active development and these fields may change in future.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you are doing. My only problem is that vLLM repository is under active development and these fields may change in future.

@oandreeva-nv This is a very good question. In fact, just like the sampling parameter processing logic.

We really cannot elegantly handle such parameters that may change in a dynamic way, but generally speaking, the frequency of changes is not large.

We can correct this problem by compile matching versions of Triton Server + vLLM + model.py.

In fact, 23.10-vllm-python-py3 has been released. The adaptation of Triton Server + vLLM + model.py is closed-loop and can run very well. Even you can change the model.py in /opt/tritonserver/backends/vllm of the vllm_backend container if you need.

Copy link
Collaborator

@oandreeva-nv oandreeva-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to keep major part of this tutorial as is, since it shows how to use Triton's vllm_backend. Thus, please don't change core part of README, you are welcome to add extra section thought.

On the other hand, it seems like this work is more related to Feature request, i.e. allow vllm_backend to set engineArgs from parameters, specified in config.pbtxt and eliminate dependency on model.json . If this is something you would like to be added to our vllm_backend, please open an issue on the main Triton issues page.

]

# The configuration of engineArgs
parameters {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do something like:

parameters [
    {
        key: ...
        value: {...}
    },
     {
        key: ...
        value: {...}
    },
    ...
]

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do something like:

parameters [
    {
        key: ...
        value: {...}
    },
     {
        key: ...
        value: {...}
    },
    ...
]

@oandreeva-nv Yes, of course we can. But in fact, this is actually a relatively common usage. Can we do the same?

For example:
https://github.com/triton-inference-server/tensorrtllm_backend/blob/47b609b670d6bb33a5ff113d98ad8a44d961c5c6/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt#L155

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The final decision is up to you, it just makes more sense for me to keep a list in parameters , and not multiple parameters entries with 1 parameter

@nv-kmcgill53
Copy link

CLA has been accepted by the Triton Team. This PR can be merged at our discretion.

@activezhao
Copy link
Author

CLA has been accepted by the Triton Team. This PR can be merged at our discretion.

@nv-kmcgill53 OK, thank u so much.

@oandreeva-nv
Copy link
Collaborator

Hi @activezhao, thanks for your work! I was thinking about the structure of this tutorial. I would like to keep original version of this tutorial, which utilizes vllm_backend repo under QuickDeploy section. How about we move this tutorial under a new folder Customization with a new README and suggested model.py and config.pbtxt. Then, in the README you can describe what new methods you've added to model.py and config.pbtxt. What do you think about this idea?

@activezhao
Copy link
Author

activezhao commented Nov 28, 2023

Hi @activezhao, thanks for your work! I was thinking about the structure of this tutorial. I would like to keep original version of this tutorial, which utilizes vllm_backend repo under QuickDeploy section. How about we move this tutorial under a new folder Customization with a new README and suggested model.py and config.pbtxt. Then, in the README you can describe what new methods you've added to model.py and config.pbtxt. What do you think about this idea?

Hi @oandreeva-nv Cool, that sounds pretty good, it's a good idea.

And could you please help us build the basic code structure, and then we can fill things in?

@activezhao
Copy link
Author

Hi @oandreeva-nv I have created a folder named Customization with new README and suggested model.py and config.pbtxt.

Could you please review the code?

Thanks


## Step 1: Prepare your model repository

To use Triton, we need to build a model repository. A sample model repository for deploying `facebook/opt-125m` using vLLM in Triton is
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, clarify that in this tutorial we are using Python backend + vLLM, so we probably also need to install vLLM in the environment

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about since we are not using vllm backend, we target this tutorial to use general triton container, and not vllm?

Copy link
Author

@activezhao activezhao Dec 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, clarify that in this tutorial we are using Python backend + vLLM, so we probably also need to install vLLM in the environment

Hi @oandreeva-nv In fact, I have used nvcr.io/nvidia/tritonserver:23.10-vllm-python-py3 with model.py + config.pbtxt which I modified in the prod environment, and now it is running well.

So, l wonder that if I can change the README.md to a best practice with details?

What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, feel free to do as you see fit. I can take a look at the final README

@oandreeva-nv
Copy link
Collaborator

Hi @activezhao , pre-commit hook failed unfortunately, could you please fix the formatting?

@activezhao
Copy link
Author

Hi @activezhao , pre-commit hook failed unfortunately, could you please fix the formatting?

OK, I will fix them.

@activezhao
Copy link
Author

Hi @oandreeva-nv I have just updated README and adjusted directory structure, could you please help me to review the code?
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants