Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REQUEST] Support x-grammar structured output framework integration #723

Open
3 tasks done
debasish-mihup opened this issue Jan 27, 2025 · 1 comment
Open
3 tasks done

Comments

@debasish-mihup
Copy link

debasish-mihup commented Jan 27, 2025

Problem

I am using exl2 in my environment to generate structured output (JSON). It works well. The issue is inferencing speed, as am unable to pass multi-instance filter to dynamic generator and thus unable to do batch inferencing.

On profiling I found the GIL is causing havoc on the throughput, as CPU usage is 100% and is a huge bottleneck. The system is unable to use additional CPUs and neither the framework lm-format-enforcer is under much active development recently.

Solution

I found there are framework for the same having much multi-instance inference support as well as significant speed-up on single thread performance.

xgrammar framework

Can you look into this or guide me to how to get started to integrate with your inference framework?

Integration Docs

Alternatives

No response

Explanation

New framework should supports:

  1. Support batch inference support with structured output
  2. Increase single thread performance, 3.5x speed-up on JSON.
  3. Less memory intensive
  4. Supports context-free generation

Paper

Blogpost

Examples

No response

Additional context

No response

Acknowledgements

  • I have looked for similar requests before submitting this one.
  • I understand that the developers have lives and my issue will be answered when possible.
  • I understand the developers of this program are human, and I will make my requests politely.
@turboderp
Copy link
Member

I would see if you can't do what you want to do with Formatron. There's an example here. Formatron isn't hindered by the GIL and so you should have more luck with multiple jobs each running their own filter in a separate thread (as in the example).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants