External Cluster Environments #1244

Shrinjay · 2023-01-27T23:29:11Z

Problem Statement

This pull requests addresses issue #1235 and enables multi-cluster operations for Jupyter Enterprise Gateway. To reiterate the problem, currently enterprise gateway launches kernels on the cluster where it is currently running. This poses limitations in cases where we want the kernel to have access to resources on a remote cluster without running the enterprise gateway on that cluster specifically. Such cases often occur in the interest of security and isolation of internal services from production services. While the k8s client supports connecting to and launching/managing resources on a remote cluster, this feature just isn't implemented in the current client.

Feature Description

The changes in this PR implement the ability for users to provide a kubeconfig file for Jupyter Enterprise Gateway to use to launch kernels on a remote cluster that the kubeconfig points to. Specifically, Jupyter Enterprise Gateway will:

Create a service account (if enabled)
Create a namespaced role for the kernel pod
Create a namespaced role binding for the kernel pod
Create the kubernetes resource required for the kernel

Operator Instructions:

Ensure your two clusters have interconnceted networks. Pods in the two clusters must be able to communicate with each other over pod IP alone.
Provide a kubeconfig file for use in the config/ subdirectory of etc/kubernetes/helm/enterprise-gateway chart.
Set externalCluster.enabled to true.

Implementation Details

When the operator enables external cluster operation and installs/upgrades the helm chart, the following steps occur:

A ConfigMap is created for the kubeconfig and mounted into the enterprise gateway pod
A ConfigMap is created for the kernel pod role and mounted into the enterprise gateway pod
The following env vars are set for multi-cluster operation: EG_USE_REMOTE_CLUSTER, EG_REMOTE_CLUSTER_KUBECONFIG_PATH, EG_DEFAULT_KERNEL_SERVICE_ACCOUNT_NAME

When the enterprise gateway pod starts:

An atomic kubernetes client is produced from the Kubernetes Client Factory pointing to the kubeconfig the operator passed in. All further operations use this client.

When a kernel is launched:

If configured, a service account is created in the remote cluster (if one does not exist) for the kernel pod
A namespaced role is created in the remote cluster for the kernel pod
A namespaced role binding between the role and service account is created for the kernel pod
The env variables above are passed to the kernel-launcher
The kernel-launcher uses the kubernetes client to launch the kernel in the remote cluster

Assuming network interconnection is setup correctly, the kernel pod should now launch and be able to communicate.

Some notes on the implementation

We've done our best to maintain as much of the smooth experience enterprise gateway currently has. However, automating network interconnection between two clusters is an infrastructure problem that's far too broad to be included as an automation here. Therefore, network interconnection is left as the responsibility of the operator.
We switch from using a statically defined client to an atomic one produced by a factory for best practice. It's much more clear where the configuration for a client comes from in this approach.
We also strived to make as much of this configurable via Helm as possible. Therefore, the kernel role for the kernel pod is defined in helm, mounted as a configmap and then read by the process proxy, rather than defined in the process proxy itself like the role bindings.

welcome · 2023-01-27T23:29:14Z

Thanks for submitting your first pull request! You are awesome! 🤗

If you haven't done so already, check out Jupyter's Code of Conduct. Also, please make sure you followed the pull request template, as this will help us review your contribution more quickly.

You can meet the other Jovyans by joining our Discourse forum. There is also a intro thread there where you can stop by and say Hi! 👋

Welcome to the Jupyter community! 🎉

for more information, see https://pre-commit.ci

…enterprise_gateway into feature/remote-cluster

for more information, see https://pre-commit.ci

Shrinjay · 2023-01-30T18:10:49Z

Seems it was mentioned in another PR that the python interrupt test failures are a red herring, is that correct?

kevin-bates · 2023-01-30T20:34:46Z

Hi @Shrinjay - thank you for providing this pull request. I hope to be able to start reviewing this sometime this afternoon or tomorrow (PST).

Seems it was mentioned in another PR that the python interrupt test failures are a red herring, is that correct?

That is correct.

Shrinjay · 2023-01-30T22:01:51Z

Hello @kevin-bates ! Happy to hear, looking forward to the review :)

lresende · 2023-01-31T03:04:24Z

Are these merely to access resources on a different cluster in general? Or do we have the intention to enable mapping userA to clusterA and userB to clusterB? And how is that done?

kevin-bates · 2023-01-31T15:19:14Z

Good question @lresende. Reading the (excellent!) description, I believe this is more of a one-time thing -EG is either managing kernel pods within the cluster in which it resides, or EG is managing kernel pods within an external cluster in which it has access but does not reside. If that statement is correct, it might be better to update the title (and feature name) to something like External Cluster Support, or similar since the current title can imply multiple, and simultaneous, cluster support.

@Shrinjay - thoughts?

Shrinjay · 2023-01-31T16:39:33Z

@kevin-bates I absolutely agree, I didn't think about it like that but calling it External Cluster and switching the helm charts to use the prefix externalCluster makes sense.

Shrinjay · 2023-01-31T16:40:55Z

Updated!

for more information, see https://pre-commit.ci

…enterprise_gateway into feature/remote-cluster

kevin-bates

Hi @Shrinjay. First I want to thank for the pull request and the most excellent write up you provided to describe this feature. That is much appreciated!

I've provided a few comments (most of which are a nit about camelCasing 🐫 😄) but there are some questions regarding edge cases related to EG_SHARED_NAMESPACE (in which the kernels are meant to run in the same namespace as EG) and BYO namespaces (in which KERNEL_NAMESPACE is provided in the request from the client) among others - mostly minor.

Also, I don't have a way to test this. Have you ensured that spark-based kernels are properly behaved (primarily wrt creation of the executor pods, etc.)?

Thank you!

docs/source/operators/deploy-kubernetes.md

kevin-bates · 2023-01-31T19:45:23Z

enterprise_gateway/services/external/k8s_client.py

@@ -0,0 +1,5 @@
+"""Instantiates a static global factory and a single atomic client"""


This comment is more about the introduction of the services/external directory. One thing we need to keep in mind is that for EG 4.0, most all of these changes will move to gateway_provisioners. Since it ONLY has the equivalent of services/processproxies and since this doesn't seem like an "external service", I'm inclined to suggest we simply include these files alongside services/processproxies/k8s.py. I don't think they warrant their own location within the package since they are solely used by KubernetesProcessProxy and its subclasses (and the launcher). Would there be an issue with moving these into services/processproxies?

Nope, I think that's fine, I can move them into processproxies and go from there

kevin-bates · 2023-01-31T19:46:00Z

enterprise_gateway/services/external/k8s_client_factory.py

+from ..utils.envutils import is_env_true
+
+
+class KubernetesClientFactory(SingletonConfigurable):


I love this! Thank you for making this better!

etc/kubernetes/helm/enterprise-gateway/templates/deployment.yaml

etc/kubernetes/helm/enterprise-gateway/templates/kernel-role.yaml

etc/kubernetes/helm/enterprise-gateway/templates/role-configmap.yaml

etc/kubernetes/helm/enterprise-gateway/values.yaml

kevin-bates · 2023-01-31T20:32:09Z

enterprise_gateway/services/processproxies/k8s.py

@@ -227,7 +229,6 @@ def _determine_kernel_pod_name(self, **kwargs: dict[str, Any] | None) -> str:
        return pod_name

    def _determine_kernel_namespace(self, **kwargs: dict[str, Any] | None) -> str:


(The following applies to lines L240-L254 below...)

Today, if the user is bringing their own namespace via KERNEL_NAMESPACE it is their responsibility to ensure proper operation within the namespace. With this PR, and if EG_USE_REMOTE_CLUSTER is True, I'm assuming the same requirement holds. Are there any pieces of new functionality that would preclude users from bringing their own namespace (where that namespace resides in the remote cluster)?

Clearly, both EG_SHARED_NAMESPACE = True and EG_USE_REMOTE_CLUSTER = True conflict. Perhaps we should check for EG_SHARED_NAMESPACE = True when obtaining the k8s client and either log and return the local client or raise an exception?

((Just to note, in gateway_provisioners the default value for EG_SHARED_NAMESPACE is True to allow for easier deployments outside of something like since they are available to any jupyter_client application.))

Something I realize should be documented better is that remote cluster operation requires the namespace to be already created before we launch a kernel there. This was mostly a security concern, as long as operations are isolated within a namespace in a remote cluster it's fine, but creating a namespace is a cluster level operation which can be scary for operators to enable.

Agreed on this, we should log a warning then ignore it if EG_USE_REMOTE_CLUSTER is enabled

kevin-bates · 2023-03-01T19:52:28Z

Hi @Shrinjay - gentle ping - could you please address the review comments?

Shrinjay · 2023-04-10T14:25:08Z

@kevin-bates My deepest apologies, I was working on this as part of an initiative at work, but I got pulled off to work on something else and this completely slipped my mind. I'm back on this now, and my contract is coming to an end, so I'll be addressing these comments and looking to get this PR done within the upcoming weeks.

Co-authored-by: Kevin Bates <[email protected]>

for more information, see https://pre-commit.ci

Shrinjay · 2023-04-14T21:26:02Z

@kevin-bates Made some updates based off your suggestions, however your comment on testing got me thinking, and I can't figure out a great way to test this in a repeatable way. Currently, we just test this manually, the only success criteria is that a kernel is able to launch as everything after that is the same as if it were running on a local cluster. I can't really tell if the processproxies and such are tested in itests as well. If they are, then we could replicate the processproxies test harness for this. Any guidance?

kevin-bates · 2023-04-17T18:05:46Z

Hi @Shrinjay - thanks for the update.

Regarding testing, we have no tests for Kubernetes process proxies. The integration test we have uses the enterprise-gateway-demo image to test process proxies YarnClusterProcessProxy and DistributedProcessProxy (in loopback mode). I'm building a mock layer in Gateway Provisioners to try to improve this, but, for now, we can keep things as is and rely on manual testing.

My biggest areas of concern are the behaviors around BYO Namespace (where the client pre-creates the namespace and references that name in KERNEL_NAMESPACE in the start kernel request) and Shared Namespace (where the kernels reside in the EG namespace). The latter doesn't really apply since namespaces can't span clusters, but it would be good to understand the behavior - especially since this is the default in Gateway Provisioners. However, if the other cluster happens to also defined the same-named namespace, what happens? I think we should probably raise an exception in this case.

I will try to spend some time reviewing this PR in the next few days.

kevin-bates

Hi @Shrinjay - I didn't have a chance to verify existing behavior, but plan to do so before its merge. I had some relative minor comments. Thanks again for contributing this!

kevin-bates · 2023-04-17T22:28:46Z

enterprise_gateway/services/processproxies/k8s.py


 from ..kernels.remotemanager import RemoteKernelManager
 from ..sessions.kernelsessionmanager import KernelSessionManager
+from ..utils.envutils import is_env_true


Could we refactor this filename to include an underscore for word separation?

Suggested change

from ..utils.envutils import is_env_true

from ..utils.env_utils import is_env_true

kevin-bates · 2023-04-17T22:28:57Z

enterprise_gateway/services/processproxies/k8s.py

@@ -239,7 +248,7 @@ def _determine_kernel_namespace(self, **kwargs: dict[str, Any] | None) -> str:

        # If KERNEL_NAMESPACE was provided, then we assume it already exists.  If not provided, then we'll
        # create the namespace and record that we'll want to delete it as well.
-        namespace = kwargs["env"].get("KERNEL_NAMESPACE")
+        namespace = os.environ.get("KERNEL_NAMESPACE")


Prior to launch, KERNEL_NAMESPACE must be pulled from the start request arguments (kwargs["env"]) rather than the EG process env, so this change needs to be reverted.

Suggested change

namespace = os.environ.get("KERNEL_NAMESPACE")

namespace = kwargs["env"].get("KERNEL_NAMESPACE")

kevin-bates · 2023-04-17T22:29:02Z

enterprise_gateway/services/processproxies/k8s.py

                    self.log.warning(f"Deleted kernel namespace: {namespace}")
                else:
                    reason = f"Error occurred creating namespace '{namespace}': {err}"
                self.log_and_raise(http_status_code=500, reason=reason)

        return namespace

+    def _create_service_account_if_not_exists(


If this is strictly for external clusters, could we rename this to something like: _create_external_service_account_if_not_exists or _create_remote_service_account_if_not_exists. Since the other configurable options refer to "external", I guess I'd prefer the former.

kevin-bates · 2023-04-17T22:29:06Z

enterprise_gateway/services/processproxies/k8s.py

+            )
+
+            self.log.info(
+                f"Created service account {service_account_name} in namespace {namespace}"


Is there any way to access the "name" of the external cluster? Seems really helpful to include that here.

Suggested change

f"Created service account {service_account_name} in namespace {namespace}"

f"Created service account {service_account_name} in namespace {namespace} of external cluster {external_cluster}"

kevin-bates · 2023-04-17T22:29:12Z

enterprise_gateway/services/processproxies/k8s.py

+                f"Created service account {service_account_name} in namespace {namespace}"
+            )
+
+    def _create_role_if_not_exists(self, namespace: str) -> None:


A similar name change to that suggested previously would be nice here.

kevin-bates · 2023-04-17T22:29:18Z

enterprise_gateway/services/processproxies/k8s.py

+
+    def _create_role_if_not_exists(self, namespace: str) -> None:
+        """If role doesn't exist in target cluster, create one. Occurs if a remote cluster is being used"""
+        role_yaml_path = os.getenv('EG_REMOTE_CLUSTER_ROLE_PATH')


Should role_yaml_path be validated for None and the file's existence?

This line (and any validation) should be moved within the if kernel_cluster_role not in remote_cluster_role_names: block at L379 so that it's only performed if needed.

kevin-bates · 2023-04-17T22:30:08Z

enterprise_gateway/services/processproxies/k8s_client_factory.py

+            else:
+                if is_env_true('EG_USE_REMOTE_CLUSTER'):
+                    self.log.warning(
+                        "Cannot use EG_USE_REMOTE_CLUSTER and EG_SHARED_NAMESPACE at the same time. Using local cluster...."


Shrinjay and others added 12 commits January 24, 2023 16:14

add remote cluster functionality

49985f5

Add context selection to helm and passthrough

d491508

Remove label

686ed39

Put back comments

7e90611

Fix names

2daf2cc

Fix docstrings, generator, and various other

ff5aab5

black fixes

20a5c56

Add remote autoconfiguration and role

0d43bec

Fix env var checking

875fafa

move to namespaced roles for remote clusters

5d1d732

Formatting and comments

19f92c7

Add documentation and cleanup

e0df4e2

[pre-commit.ci] auto fixes from pre-commit.com hooks

5e23382

for more information, see https://pre-commit.ci

kevin-bates added enhancement kubernetes performance & scalability gateway-provisioners Has application for gateway-provisioners (EG 4.0) labels Jan 28, 2023

Shrinjay and others added 4 commits January 30, 2023 10:43

Fix doc requirements

c9b90ce

Merge branch 'feature/remote-cluster' of https://github.com/Shrinjay/…

0e0ae6f

…enterprise_gateway into feature/remote-cluster

Merge branch 'main' into feature/remote-cluster

9a9416e

[pre-commit.ci] auto fixes from pre-commit.com hooks

9ba9ab7

for more information, see https://pre-commit.ci

Shrinjay changed the title ~~Multi-Cluster Environments~~ External Cluster Environments Jan 31, 2023

Change to external cluster naming

0a4c6fb

pre-commit-ci bot and others added 3 commits January 31, 2023 16:41

[pre-commit.ci] auto fixes from pre-commit.com hooks

d8b55b6

for more information, see https://pre-commit.ci

Chart lint fixes

94ab66e

Merge branch 'feature/remote-cluster' of https://github.com/Shrinjay/…

3514129

…enterprise_gateway into feature/remote-cluster

kevin-bates reviewed Jan 31, 2023

View reviewed changes

kevin-bates added the waiting for author Waiting for information from item's author label Feb 7, 2023

Shrinjay and others added 11 commits April 14, 2023 14:49

Update docs/source/operators/deploy-kubernetes.md

4fbe8b7

Co-authored-by: Kevin Bates <[email protected]>

Update docs/source/operators/deploy-kubernetes.md

51256b9

Co-authored-by: Kevin Bates <[email protected]>

Update enterprise_gateway/services/processproxies/crd.py

8107dac

Co-authored-by: Kevin Bates <[email protected]>

Apply suggestions from code review

26cba1a

Co-authored-by: Kevin Bates <[email protected]>

Update docs/source/operators/deploy-kubernetes.md

55065ca

Co-authored-by: Kevin Bates <[email protected]>

Update etc/kubernetes/helm/enterprise-gateway/templates/deployment.yaml

3282e14

Co-authored-by: Kevin Bates <[email protected]>

progress

047538e

throw warning

7c06f2b

merge latest

c909106

Merge branch 'main' into feature/remote-cluster

a1a9e46

[pre-commit.ci] auto fixes from pre-commit.com hooks

8554343

for more information, see https://pre-commit.ci

kevin-bates reviewed Apr 17, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

External Cluster Environments #1244

External Cluster Environments #1244

Shrinjay commented Jan 27, 2023 •

edited

Loading

welcome bot commented Jan 27, 2023

Shrinjay commented Jan 30, 2023

kevin-bates commented Jan 30, 2023

Shrinjay commented Jan 30, 2023

lresende commented Jan 31, 2023

kevin-bates commented Jan 31, 2023

Shrinjay commented Jan 31, 2023

Shrinjay commented Jan 31, 2023

kevin-bates left a comment

kevin-bates Jan 31, 2023

Shrinjay Apr 14, 2023

kevin-bates Jan 31, 2023

kevin-bates Jan 31, 2023

Shrinjay Apr 14, 2023

kevin-bates commented Mar 1, 2023

Shrinjay commented Apr 10, 2023

Shrinjay commented Apr 14, 2023

kevin-bates commented Apr 17, 2023

kevin-bates left a comment

kevin-bates Apr 17, 2023

kevin-bates Apr 17, 2023

kevin-bates Apr 17, 2023

kevin-bates Apr 17, 2023

kevin-bates Apr 17, 2023

kevin-bates Apr 17, 2023

kevin-bates Apr 17, 2023

		@@ -0,0 +1,5 @@
		"""Instantiates a static global factory and a single atomic client"""

		from ..utils.envutils import is_env_true


		class KubernetesClientFactory(SingletonConfigurable):

		@@ -227,7 +229,6 @@ def _determine_kernel_pod_name(self, **kwargs: dict[str, Any] \| None) -> str:
		return pod_name

		def _determine_kernel_namespace(self, **kwargs: dict[str, Any] \| None) -> str:

	from ..utils.envutils import is_env_true
	from ..utils.env_utils import is_env_true

	namespace = os.environ.get("KERNEL_NAMESPACE")
	namespace = kwargs["env"].get("KERNEL_NAMESPACE")

	f"Created service account {service_account_name} in namespace {namespace}"
	f"Created service account {service_account_name} in namespace {namespace} of external cluster {external_cluster}"

External Cluster Environments #1244

Are you sure you want to change the base?

External Cluster Environments #1244

Conversation

Shrinjay commented Jan 27, 2023 • edited Loading

Problem Statement

Feature Description

Operator Instructions:

Implementation Details

welcome bot commented Jan 27, 2023

Shrinjay commented Jan 30, 2023

kevin-bates commented Jan 30, 2023

Shrinjay commented Jan 30, 2023

lresende commented Jan 31, 2023

kevin-bates commented Jan 31, 2023

Shrinjay commented Jan 31, 2023

Shrinjay commented Jan 31, 2023

kevin-bates left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevin-bates commented Mar 1, 2023

Shrinjay commented Apr 10, 2023

Shrinjay commented Apr 14, 2023

kevin-bates commented Apr 17, 2023

kevin-bates left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Shrinjay commented Jan 27, 2023 •

edited

Loading