-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
External Cluster Environments #1244
base: main
Are you sure you want to change the base?
Conversation
Thanks for submitting your first pull request! You are awesome! 🤗 |
for more information, see https://pre-commit.ci
…enterprise_gateway into feature/remote-cluster
for more information, see https://pre-commit.ci
Seems it was mentioned in another PR that the python interrupt test failures are a red herring, is that correct? |
Hi @Shrinjay - thank you for providing this pull request. I hope to be able to start reviewing this sometime this afternoon or tomorrow (PST).
That is correct. |
Hello @kevin-bates ! Happy to hear, looking forward to the review :) |
Are these merely to access resources on a different cluster in general? Or do we have the intention to enable mapping userA to clusterA and userB to clusterB? And how is that done? |
Good question @lresende. Reading the (excellent!) description, I believe this is more of a one-time thing -EG is either managing kernel pods within the cluster in which it resides, or EG is managing kernel pods within an external cluster in which it has access but does not reside. If that statement is correct, it might be better to update the title (and feature name) to something like External Cluster Support, or similar since the current title can imply multiple, and simultaneous, cluster support. @Shrinjay - thoughts? |
@kevin-bates I absolutely agree, I didn't think about it like that but calling it External Cluster and switching the helm charts to use the prefix |
Updated! |
for more information, see https://pre-commit.ci
…enterprise_gateway into feature/remote-cluster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Shrinjay. First I want to thank for the pull request and the most excellent write up you provided to describe this feature. That is much appreciated!
I've provided a few comments (most of which are a nit about camelCasing 🐫 😄) but there are some questions regarding edge cases related to EG_SHARED_NAMESPACE
(in which the kernels are meant to run in the same namespace as EG) and BYO namespaces (in which KERNEL_NAMESPACE
is provided in the request from the client) among others - mostly minor.
Also, I don't have a way to test this. Have you ensured that spark-based kernels are properly behaved (primarily wrt creation of the executor pods, etc.)?
Thank you!
@@ -0,0 +1,5 @@ | |||
"""Instantiates a static global factory and a single atomic client""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is more about the introduction of the services/external
directory. One thing we need to keep in mind is that for EG 4.0, most all of these changes will move to gateway_provisioners
. Since it ONLY has the equivalent of services/processproxies
and since this doesn't seem like an "external service", I'm inclined to suggest we simply include these files alongside services/processproxies/k8s.py
. I don't think they warrant their own location within the package since they are solely used by KubernetesProcessProxy
and its subclasses (and the launcher). Would there be an issue with moving these into services/processproxies
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, I think that's fine, I can move them into processproxies and go from there
from ..utils.envutils import is_env_true | ||
|
||
|
||
class KubernetesClientFactory(SingletonConfigurable): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love this! Thank you for making this better!
etc/kubernetes/helm/enterprise-gateway/templates/deployment.yaml
Outdated
Show resolved
Hide resolved
etc/kubernetes/helm/enterprise-gateway/templates/kernel-role.yaml
Outdated
Show resolved
Hide resolved
etc/kubernetes/helm/enterprise-gateway/templates/role-configmap.yaml
Outdated
Show resolved
Hide resolved
@@ -227,7 +229,6 @@ def _determine_kernel_pod_name(self, **kwargs: dict[str, Any] | None) -> str: | |||
return pod_name | |||
|
|||
def _determine_kernel_namespace(self, **kwargs: dict[str, Any] | None) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(The following applies to lines L240-L254 below...)
-
Today, if the user is bringing their own namespace via
KERNEL_NAMESPACE
it is their responsibility to ensure proper operation within the namespace. With this PR, and ifEG_USE_REMOTE_CLUSTER
is True, I'm assuming the same requirement holds. Are there any pieces of new functionality that would preclude users from bringing their own namespace (where that namespace resides in the remote cluster)? -
Clearly, both
EG_SHARED_NAMESPACE
= True andEG_USE_REMOTE_CLUSTER
= True conflict. Perhaps we should check forEG_SHARED_NAMESPACE
= True when obtaining the k8s client and either log and return the local client or raise an exception?
((Just to note, in gateway_provisioners
the default value for EG_SHARED_NAMESPACE
is True to allow for easier deployments outside of something like since they are available to any jupyter_client
application.))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Something I realize should be documented better is that remote cluster operation requires the namespace to be already created before we launch a kernel there. This was mostly a security concern, as long as operations are isolated within a namespace in a remote cluster it's fine, but creating a namespace is a cluster level operation which can be scary for operators to enable.
-
Agreed on this, we should log a warning then ignore it if
EG_USE_REMOTE_CLUSTER
is enabled
Hi @Shrinjay - gentle ping - could you please address the review comments? |
@kevin-bates My deepest apologies, I was working on this as part of an initiative at work, but I got pulled off to work on something else and this completely slipped my mind. I'm back on this now, and my contract is coming to an end, so I'll be addressing these comments and looking to get this PR done within the upcoming weeks. |
Co-authored-by: Kevin Bates <[email protected]>
Co-authored-by: Kevin Bates <[email protected]>
Co-authored-by: Kevin Bates <[email protected]>
Co-authored-by: Kevin Bates <[email protected]>
Co-authored-by: Kevin Bates <[email protected]>
Co-authored-by: Kevin Bates <[email protected]>
for more information, see https://pre-commit.ci
@kevin-bates Made some updates based off your suggestions, however your comment on testing got me thinking, and I can't figure out a great way to test this in a repeatable way. Currently, we just test this manually, the only success criteria is that a kernel is able to launch as everything after that is the same as if it were running on a local cluster. I can't really tell if the processproxies and such are tested in itests as well. If they are, then we could replicate the processproxies test harness for this. Any guidance? |
Hi @Shrinjay - thanks for the update. Regarding testing, we have no tests for Kubernetes process proxies. The integration test we have uses the My biggest areas of concern are the behaviors around BYO Namespace (where the client pre-creates the namespace and references that name in I will try to spend some time reviewing this PR in the next few days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Shrinjay - I didn't have a chance to verify existing behavior, but plan to do so before its merge. I had some relative minor comments. Thanks again for contributing this!
|
||
from ..kernels.remotemanager import RemoteKernelManager | ||
from ..sessions.kernelsessionmanager import KernelSessionManager | ||
from ..utils.envutils import is_env_true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we refactor this filename to include an underscore for word separation?
from ..utils.envutils import is_env_true | |
from ..utils.env_utils import is_env_true |
@@ -239,7 +248,7 @@ def _determine_kernel_namespace(self, **kwargs: dict[str, Any] | None) -> str: | |||
|
|||
# If KERNEL_NAMESPACE was provided, then we assume it already exists. If not provided, then we'll | |||
# create the namespace and record that we'll want to delete it as well. | |||
namespace = kwargs["env"].get("KERNEL_NAMESPACE") | |||
namespace = os.environ.get("KERNEL_NAMESPACE") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prior to launch, KERNEL_NAMESPACE
must be pulled from the start request arguments (kwargs["env"]
) rather than the EG process env, so this change needs to be reverted.
namespace = os.environ.get("KERNEL_NAMESPACE") | |
namespace = kwargs["env"].get("KERNEL_NAMESPACE") |
self.log.warning(f"Deleted kernel namespace: {namespace}") | ||
else: | ||
reason = f"Error occurred creating namespace '{namespace}': {err}" | ||
self.log_and_raise(http_status_code=500, reason=reason) | ||
|
||
return namespace | ||
|
||
def _create_service_account_if_not_exists( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is strictly for external clusters, could we rename this to something like: _create_external_service_account_if_not_exists
or _create_remote_service_account_if_not_exists
. Since the other configurable options refer to "external", I guess I'd prefer the former.
) | ||
|
||
self.log.info( | ||
f"Created service account {service_account_name} in namespace {namespace}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way to access the "name" of the external cluster? Seems really helpful to include that here.
f"Created service account {service_account_name} in namespace {namespace}" | |
f"Created service account {service_account_name} in namespace {namespace} of external cluster {external_cluster}" |
f"Created service account {service_account_name} in namespace {namespace}" | ||
) | ||
|
||
def _create_role_if_not_exists(self, namespace: str) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A similar name change to that suggested previously would be nice here.
|
||
def _create_role_if_not_exists(self, namespace: str) -> None: | ||
"""If role doesn't exist in target cluster, create one. Occurs if a remote cluster is being used""" | ||
role_yaml_path = os.getenv('EG_REMOTE_CLUSTER_ROLE_PATH') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should role_yaml_path
be validated for None
and the file's existence?
This line (and any validation) should be moved within the if kernel_cluster_role not in remote_cluster_role_names:
block at L379 so that it's only performed if needed.
else: | ||
if is_env_true('EG_USE_REMOTE_CLUSTER'): | ||
self.log.warning( | ||
"Cannot use EG_USE_REMOTE_CLUSTER and EG_SHARED_NAMESPACE at the same time. Using local cluster...." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Problem Statement
This pull requests addresses issue #1235 and enables multi-cluster operations for Jupyter Enterprise Gateway. To reiterate the problem, currently enterprise gateway launches kernels on the cluster where it is currently running. This poses limitations in cases where we want the kernel to have access to resources on a remote cluster without running the enterprise gateway on that cluster specifically. Such cases often occur in the interest of security and isolation of internal services from production services. While the k8s client supports connecting to and launching/managing resources on a remote cluster, this feature just isn't implemented in the current client.
Feature Description
The changes in this PR implement the ability for users to provide a kubeconfig file for Jupyter Enterprise Gateway to use to launch kernels on a remote cluster that the kubeconfig points to. Specifically, Jupyter Enterprise Gateway will:
Operator Instructions:
config/
subdirectory ofetc/kubernetes/helm/enterprise-gateway
chart.externalCluster.enabled
totrue
.Implementation Details
When the operator enables external cluster operation and installs/upgrades the helm chart, the following steps occur:
EG_USE_REMOTE_CLUSTER
,EG_REMOTE_CLUSTER_KUBECONFIG_PATH
,EG_DEFAULT_KERNEL_SERVICE_ACCOUNT_NAME
When the enterprise gateway pod starts:
When a kernel is launched:
Assuming network interconnection is setup correctly, the kernel pod should now launch and be able to communicate.
Some notes on the implementation