Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StableDiffusion w/ RayServe on inf2 broken #604

Open
askulkarni2 opened this issue Jul 31, 2024 · 0 comments
Open

StableDiffusion w/ RayServe on inf2 broken #604

askulkarni2 opened this issue Jul 31, 2024 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@askulkarni2
Copy link
Collaborator

Description

When deploying Stable Diffusion XL Base Model with Inferentia, Ray Serve, the client receives a 504. After checking Ray dashboard, we see a health check failure for replicas. Digging into dead actor logs we see a kernel crash dump preceded by a memory allocation error as shown in the attached crash dump.

dump.txt

Steps to reproduce the behavior:

Deploy https://awslabs.github.io/data-on-eks/docs/gen-ai/inference/StableDiffusion-inf2

Expected behavior

Works as documentend

Actual behavior

Ray replicas crash with a healthcheck failure.

Additional context

This seems to be an issue with amazon-eks-gpu-node-1.29-v20240729. This was working with amazon-eks-gpu-node-1.29-v20240703. But since we don't pin the AMI in karpenter it picked up the new AMI and we started seeing the error. We should go through all our blueprints and pin everything.

@askulkarni2 askulkarni2 added the bug Something isn't working label Jul 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants