Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird network issues when running inside Singularity container #3531

Closed
DeepHorizons opened this issue May 13, 2019 · 12 comments
Closed

Weird network issues when running inside Singularity container #3531

DeepHorizons opened this issue May 13, 2019 · 12 comments
Assignees

Comments

@DeepHorizons
Copy link

Version of Singularity:

3.1.0

Expected behavior

We run our application and it works without issues or errors

Actual behavior

When we run our application inside a Singularity image about 50% of the time we get an error from one of the libraries we use. The other 50% it works fine.

When running outside of Singularity there are no issues.

Steps to reproduce behavior

We have not yet been able to reproduce it reliably, it seems to be random.

I originally though this was an issue with the library we were using but after testing our application outside of Singularity and having it work 100% I'm looking into if Singularity is doing something weird. There is some additional information on the library end here. Azure/azure-storage-cpp#259

What is Singularity's roll with containers and the network? Is there a way I can output network information?

@cclerget
Copy link
Collaborator

Hi @DeepHorizons , indeed this is weird, how did you run image ?

@cclerget cclerget self-assigned this May 13, 2019
@DeepHorizons
Copy link
Author

singularity run -B /etc/localtime -B /usr/share/zoneinfo/ <image>.sif python3 <script>

The image hader is

Bootstrap: docker
From: ubuntu:16.04
...

It also doesn't have a runscript, which i'm assuming makes run and exec the same.

@cclerget
Copy link
Collaborator

run and exec scripts are a bit different, but I don't think the issue lies here. As a first assumption, I see you used python maybe it takes modules from $HOME directory and by mixing them with those in container weird behaviour appears. You can try by adding --contain or --no-home option to tell Singularity to not mount $HOME (or an empty one) inside container and see how it goes.

@DeepHorizons
Copy link
Author

We depend on some folders in the home directory but I tried my best to run it in a contained environment as such

singularity run -B /etc/localtime -B /usr/share/zoneinfo/ -B /home/user/<folder> -B /dev --no-home --contain <image>.sif python3 <script>

But the problem still persists. Something I'm noticing is that if I run the command manually it seems to work, as compared to having the command run on boot.

@GodloveD
Copy link
Collaborator

Are you running this on a cluster? Are there certain nodes where it runs and certain nodes where it fails? Does it run reliably as a single job and maybe it fails when it becomes parallel?

@DeepHorizons
Copy link
Author

It has all been run on one machine, and only one instance is running at a time.

@jmstover
Copy link
Contributor

jmstover commented May 14, 2019

Looking at the Azure issue, is DNS resolution actually happening and the failure is on the end point connection? It says you could manually run dig to query... but what about when it's running by itself and fails?

Would it be possible for you to try subverting the DNS lookup by adding an entry into /etc/hosts ?

@DeepHorizons
Copy link
Author

I got around to do more testing and found some interesting behaviors

  • Modifying /etc/hosts on the host or mounting it into the container has no impact

Since I was seeing issues with it only on boot, I though maybe it was related to the boot order. So I decided to try sleeps in different places.

  • Adding a sleep 30 before the execution of singularity will cause it to work. A sleep inside the singularity container has no impact. Restarting the process inside the original container started at boot has no impact. Killing the container and starting a new one will cause it to work.

So it seems there is some system state that Singularity is "remembering" but gets fixed/settled a little after boot. Perhaps adjusting the boot order would fix this?

  • Causing singularity to be the last thing systemd starts has no impact.

I tried starting it after network-online.trigger, graphics.trigger, and writing my own trigger that is the last thing called. The same issue persists. So there is something that happens after systemd has completed booting the system. Looking at journalctl I see a "... enp1s0: link down" message a little bit after starting my service and before singularity starts running, but it comes back a few seconds later after the system has reported it finished booting.

@cclerget
Copy link
Collaborator

@DeepHorizons Looks like a side effect of how systemd start service, because even if singularity is the last one it doesn't forcibly mean than other services have fully started, and you said a sleep 30 cause it to work. It may be possible that /etc/resolv.conf hasn't been yet generated by systemd-resolved when singularity start. Try to run singularity by adding --dns X.X.X.X with your DNS setup.

@jscook2345
Copy link
Contributor

@DeepHorizons

Any feedback on @cclerget's suggestion to try using the --dns flag?

@cclerget
Copy link
Collaborator

Closing, feel free to re-open it if required

@DeepHorizons
Copy link
Author

For sake of completeness, the --dns option did fix the issue. The problem came back even with the other workaround. Once I added the --dns flag it worked 100% of the time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants