Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

External Cluster Environments #1244

Open
wants to merge 32 commits into
base: main
Choose a base branch
from

Conversation

Shrinjay
Copy link

@Shrinjay Shrinjay commented Jan 27, 2023

Problem Statement

This pull requests addresses issue #1235 and enables multi-cluster operations for Jupyter Enterprise Gateway. To reiterate the problem, currently enterprise gateway launches kernels on the cluster where it is currently running. This poses limitations in cases where we want the kernel to have access to resources on a remote cluster without running the enterprise gateway on that cluster specifically. Such cases often occur in the interest of security and isolation of internal services from production services. While the k8s client supports connecting to and launching/managing resources on a remote cluster, this feature just isn't implemented in the current client.

Feature Description

The changes in this PR implement the ability for users to provide a kubeconfig file for Jupyter Enterprise Gateway to use to launch kernels on a remote cluster that the kubeconfig points to. Specifically, Jupyter Enterprise Gateway will:

  • Create a service account (if enabled)
  • Create a namespaced role for the kernel pod
  • Create a namespaced role binding for the kernel pod
  • Create the kubernetes resource required for the kernel

Operator Instructions:

  1. Ensure your two clusters have interconnceted networks. Pods in the two clusters must be able to communicate with each other over pod IP alone.
  2. Provide a kubeconfig file for use in the config/ subdirectory of etc/kubernetes/helm/enterprise-gateway chart.
  3. Set externalCluster.enabled to true.

Implementation Details

When the operator enables external cluster operation and installs/upgrades the helm chart, the following steps occur:

  • A ConfigMap is created for the kubeconfig and mounted into the enterprise gateway pod
  • A ConfigMap is created for the kernel pod role and mounted into the enterprise gateway pod
  • The following env vars are set for multi-cluster operation: EG_USE_REMOTE_CLUSTER, EG_REMOTE_CLUSTER_KUBECONFIG_PATH, EG_DEFAULT_KERNEL_SERVICE_ACCOUNT_NAME

When the enterprise gateway pod starts:

  • An atomic kubernetes client is produced from the Kubernetes Client Factory pointing to the kubeconfig the operator passed in. All further operations use this client.

When a kernel is launched:

  • If configured, a service account is created in the remote cluster (if one does not exist) for the kernel pod
  • A namespaced role is created in the remote cluster for the kernel pod
  • A namespaced role binding between the role and service account is created for the kernel pod
  • The env variables above are passed to the kernel-launcher
  • The kernel-launcher uses the kubernetes client to launch the kernel in the remote cluster

Assuming network interconnection is setup correctly, the kernel pod should now launch and be able to communicate.

Some notes on the implementation

  • We've done our best to maintain as much of the smooth experience enterprise gateway currently has. However, automating network interconnection between two clusters is an infrastructure problem that's far too broad to be included as an automation here. Therefore, network interconnection is left as the responsibility of the operator.
  • We switch from using a statically defined client to an atomic one produced by a factory for best practice. It's much more clear where the configuration for a client comes from in this approach.
  • We also strived to make as much of this configurable via Helm as possible. Therefore, the kernel role for the kernel pod is defined in helm, mounted as a configmap and then read by the process proxy, rather than defined in the process proxy itself like the role bindings.

@welcome
Copy link

welcome bot commented Jan 27, 2023

Thanks for submitting your first pull request! You are awesome! 🤗

If you haven't done so already, check out Jupyter's Code of Conduct. Also, please make sure you followed the pull request template, as this will help us review your contribution more quickly.
welcome
You can meet the other Jovyans by joining our Discourse forum. There is also a intro thread there where you can stop by and say Hi! 👋

Welcome to the Jupyter community! 🎉

@Shrinjay
Copy link
Author

Seems it was mentioned in another PR that the python interrupt test failures are a red herring, is that correct?

@kevin-bates
Copy link
Member

Hi @Shrinjay - thank you for providing this pull request. I hope to be able to start reviewing this sometime this afternoon or tomorrow (PST).

Seems it was mentioned in another PR that the python interrupt test failures are a red herring, is that correct?

That is correct.

@Shrinjay
Copy link
Author

Hello @kevin-bates ! Happy to hear, looking forward to the review :)

@lresende
Copy link
Member

Are these merely to access resources on a different cluster in general? Or do we have the intention to enable mapping userA to clusterA and userB to clusterB? And how is that done?

@kevin-bates
Copy link
Member

Good question @lresende. Reading the (excellent!) description, I believe this is more of a one-time thing -EG is either managing kernel pods within the cluster in which it resides, or EG is managing kernel pods within an external cluster in which it has access but does not reside. If that statement is correct, it might be better to update the title (and feature name) to something like External Cluster Support, or similar since the current title can imply multiple, and simultaneous, cluster support.

@Shrinjay - thoughts?

@Shrinjay
Copy link
Author

@kevin-bates I absolutely agree, I didn't think about it like that but calling it External Cluster and switching the helm charts to use the prefix externalCluster makes sense.

@Shrinjay Shrinjay changed the title Multi-Cluster Environments External Cluster Environments Jan 31, 2023
@Shrinjay
Copy link
Author

Updated!

Copy link
Member

@kevin-bates kevin-bates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Shrinjay. First I want to thank for the pull request and the most excellent write up you provided to describe this feature. That is much appreciated!

I've provided a few comments (most of which are a nit about camelCasing 🐫 😄) but there are some questions regarding edge cases related to EG_SHARED_NAMESPACE (in which the kernels are meant to run in the same namespace as EG) and BYO namespaces (in which KERNEL_NAMESPACE is provided in the request from the client) among others - mostly minor.

Also, I don't have a way to test this. Have you ensured that spark-based kernels are properly behaved (primarily wrt creation of the executor pods, etc.)?

Thank you!

docs/source/operators/deploy-kubernetes.md Outdated Show resolved Hide resolved
docs/source/operators/deploy-kubernetes.md Outdated Show resolved Hide resolved
docs/source/operators/deploy-kubernetes.md Outdated Show resolved Hide resolved
@@ -0,0 +1,5 @@
"""Instantiates a static global factory and a single atomic client"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is more about the introduction of the services/external directory. One thing we need to keep in mind is that for EG 4.0, most all of these changes will move to gateway_provisioners. Since it ONLY has the equivalent of services/processproxies and since this doesn't seem like an "external service", I'm inclined to suggest we simply include these files alongside services/processproxies/k8s.py. I don't think they warrant their own location within the package since they are solely used by KubernetesProcessProxy and its subclasses (and the launcher). Would there be an issue with moving these into services/processproxies?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, I think that's fine, I can move them into processproxies and go from there

from ..utils.envutils import is_env_true


class KubernetesClientFactory(SingletonConfigurable):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this! Thank you for making this better!

etc/kubernetes/helm/enterprise-gateway/values.yaml Outdated Show resolved Hide resolved
@@ -227,7 +229,6 @@ def _determine_kernel_pod_name(self, **kwargs: dict[str, Any] | None) -> str:
return pod_name

def _determine_kernel_namespace(self, **kwargs: dict[str, Any] | None) -> str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(The following applies to lines L240-L254 below...)

  1. Today, if the user is bringing their own namespace via KERNEL_NAMESPACE it is their responsibility to ensure proper operation within the namespace. With this PR, and if EG_USE_REMOTE_CLUSTER is True, I'm assuming the same requirement holds. Are there any pieces of new functionality that would preclude users from bringing their own namespace (where that namespace resides in the remote cluster)?

  2. Clearly, both EG_SHARED_NAMESPACE = True and EG_USE_REMOTE_CLUSTER = True conflict. Perhaps we should check for EG_SHARED_NAMESPACE = True when obtaining the k8s client and either log and return the local client or raise an exception?

((Just to note, in gateway_provisioners the default value for EG_SHARED_NAMESPACE is True to allow for easier deployments outside of something like since they are available to any jupyter_client application.))

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Something I realize should be documented better is that remote cluster operation requires the namespace to be already created before we launch a kernel there. This was mostly a security concern, as long as operations are isolated within a namespace in a remote cluster it's fine, but creating a namespace is a cluster level operation which can be scary for operators to enable.

  2. Agreed on this, we should log a warning then ignore it if EG_USE_REMOTE_CLUSTER is enabled

@kevin-bates kevin-bates added the waiting for author Waiting for information from item's author label Feb 7, 2023
@kevin-bates
Copy link
Member

Hi @Shrinjay - gentle ping - could you please address the review comments?

@Shrinjay
Copy link
Author

@kevin-bates My deepest apologies, I was working on this as part of an initiative at work, but I got pulled off to work on something else and this completely slipped my mind. I'm back on this now, and my contract is coming to an end, so I'll be addressing these comments and looking to get this PR done within the upcoming weeks.

@Shrinjay
Copy link
Author

@kevin-bates Made some updates based off your suggestions, however your comment on testing got me thinking, and I can't figure out a great way to test this in a repeatable way. Currently, we just test this manually, the only success criteria is that a kernel is able to launch as everything after that is the same as if it were running on a local cluster. I can't really tell if the processproxies and such are tested in itests as well. If they are, then we could replicate the processproxies test harness for this. Any guidance?

@kevin-bates
Copy link
Member

Hi @Shrinjay - thanks for the update.

Regarding testing, we have no tests for Kubernetes process proxies. The integration test we have uses the enterprise-gateway-demo image to test process proxies YarnClusterProcessProxy and DistributedProcessProxy (in loopback mode). I'm building a mock layer in Gateway Provisioners to try to improve this, but, for now, we can keep things as is and rely on manual testing.

My biggest areas of concern are the behaviors around BYO Namespace (where the client pre-creates the namespace and references that name in KERNEL_NAMESPACE in the start kernel request) and Shared Namespace (where the kernels reside in the EG namespace). The latter doesn't really apply since namespaces can't span clusters, but it would be good to understand the behavior - especially since this is the default in Gateway Provisioners. However, if the other cluster happens to also defined the same-named namespace, what happens? I think we should probably raise an exception in this case.

I will try to spend some time reviewing this PR in the next few days.

Copy link
Member

@kevin-bates kevin-bates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Shrinjay - I didn't have a chance to verify existing behavior, but plan to do so before its merge. I had some relative minor comments. Thanks again for contributing this!


from ..kernels.remotemanager import RemoteKernelManager
from ..sessions.kernelsessionmanager import KernelSessionManager
from ..utils.envutils import is_env_true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we refactor this filename to include an underscore for word separation?

Suggested change
from ..utils.envutils import is_env_true
from ..utils.env_utils import is_env_true

@@ -239,7 +248,7 @@ def _determine_kernel_namespace(self, **kwargs: dict[str, Any] | None) -> str:

# If KERNEL_NAMESPACE was provided, then we assume it already exists. If not provided, then we'll
# create the namespace and record that we'll want to delete it as well.
namespace = kwargs["env"].get("KERNEL_NAMESPACE")
namespace = os.environ.get("KERNEL_NAMESPACE")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prior to launch, KERNEL_NAMESPACE must be pulled from the start request arguments (kwargs["env"]) rather than the EG process env, so this change needs to be reverted.

Suggested change
namespace = os.environ.get("KERNEL_NAMESPACE")
namespace = kwargs["env"].get("KERNEL_NAMESPACE")

self.log.warning(f"Deleted kernel namespace: {namespace}")
else:
reason = f"Error occurred creating namespace '{namespace}': {err}"
self.log_and_raise(http_status_code=500, reason=reason)

return namespace

def _create_service_account_if_not_exists(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is strictly for external clusters, could we rename this to something like: _create_external_service_account_if_not_exists or _create_remote_service_account_if_not_exists. Since the other configurable options refer to "external", I guess I'd prefer the former.

)

self.log.info(
f"Created service account {service_account_name} in namespace {namespace}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to access the "name" of the external cluster? Seems really helpful to include that here.

Suggested change
f"Created service account {service_account_name} in namespace {namespace}"
f"Created service account {service_account_name} in namespace {namespace} of external cluster {external_cluster}"

f"Created service account {service_account_name} in namespace {namespace}"
)

def _create_role_if_not_exists(self, namespace: str) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A similar name change to that suggested previously would be nice here.


def _create_role_if_not_exists(self, namespace: str) -> None:
"""If role doesn't exist in target cluster, create one. Occurs if a remote cluster is being used"""
role_yaml_path = os.getenv('EG_REMOTE_CLUSTER_ROLE_PATH')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should role_yaml_path be validated for None and the file's existence?

This line (and any validation) should be moved within the if kernel_cluster_role not in remote_cluster_role_names: block at L379 so that it's only performed if needed.

else:
if is_env_true('EG_USE_REMOTE_CLUSTER'):
self.log.warning(
"Cannot use EG_USE_REMOTE_CLUSTER and EG_SHARED_NAMESPACE at the same time. Using local cluster...."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement gateway-provisioners Has application for gateway-provisioners (EG 4.0) kubernetes performance & scalability waiting for author Waiting for information from item's author
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants