Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: adds Globus Connect Server-sourced assets as a datasource (via HTTPS) #675

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

jbottigliero
Copy link

@jbottigliero jbottigliero commented Dec 2, 2024

This pull request adds integration with Globus. Specifically, adding the ability to source data via HTTPS on a Globus Connect Server instance, authenticated using Globus Auth.

  • The credential_provider was created based off of the #src/util/google_oauth2.ts implementation and other OAuth-like credential providers (e.g. middleauth and ngauth).

The Globus credential provider uses PKCE for the authorization flow, which I've added a few basic utilities around 1.

@ravescovi has successfully deployed an instance of Neuroglancer configured with Globus using the proposed changes – much of this implementation is based on his initial work integrating with the Neuroglancer codebase.

We're looking forward to discussing the implementation and seeing what needs to be addressed to get this into the mainline!

Footnotes

  1. These utilities could be replaced with an external library (e.g. pkce-challenge) if that is preferred. It might be worth noting many of added methods were pulled from the Globus SDK for JavaScript directly.

* own Client ID from Globus and substitute it in.
* @see https://docs.globus.org/api/auth/developer-guide/#developing-apps
*/
GLOBUS_CLIENT_ID: JSON.stringify("f3c5dd86-8c8e-4393-8f46-3bfa32bfcd73"),
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This default value is a Client ID that is managed by the Globus team.

It is something that could be distributed as part of the codebase or removed as a default and only referenced here as a comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 from my side for at least not needing a fork to set a different value.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joshmoore – that was my intent here, but I am a bit unsure this method works as intended...

I thought something like npm run build -- --define GLOBUS_CLIENT_ID=example was the intended use, but it looks like the NEUROGLANCER_CLI environment flag needs to be disabled in order for the incoming --define properties to be merged.

Running npm run build -- --env NEUROGLANCER_CLI=false ... encounters an error due to the .strict() usage in build_tools/cli.ts – I don't want to derail the addition of this functionality, but just wanted to make sure I had a better grasp on the change and how it is used.

@jbms
Copy link
Collaborator

jbms commented Jan 31, 2025

Thanks, and sorry for the delay in responding. Part of the delay is that I had been working on a major refactor of file access in Neuroglancer.

Now that it has landed, it should make it easier to integrate additional data sources like this, but some refactoring of your PR will be needed.

In the README you mention that the user needs to enter a UUID --- can you describe a bit more about how Globus works and how this UUID is handled?

If the UUID is necessary to access the server, shouldn't it be included in the datasource URL somehow so that when sharing a link to the Neuroglancer state, another user doesn't have to enter it?

@jbms
Copy link
Collaborator

jbms commented Jan 31, 2025

Also, can you clarify how the client ids work and how the default client id you have provided will work?

As far as I can gather, users host their own globus server, but they rely on the central globus authentication server?

What are the allowed origins for the default client id? Is the list of allowed origins global or specific to a given globus server instance? Similarly, is the authentication token that is received valid for all globus servers or just a single instance (based on the way the scopes work, I guess the answer is just a single server)? Or is that irrelevant because users are local to a single server instance anyway?

The scope doesn't seem to say anything about what permissions are granted? Does that mean that all permissions (i.e. read and write) are granted?

A general issue that comes up is that a user may wish to view certain datasets in Neuroglancer, and therefore must necessarily grant a given Neuroglancer instance read access to that particular dataset. However, it is often not possible to limit permissions narrowly like that, and instead the user is forced to either grant no permission, or grant very broad permissions, e.g. read access to everything. This may not be an issue if Neuroglancer can be trusted with full access, e.g. because the person administering the Neuroglancer instance is also administering the Globus server, but often it is an issue.

This issue is exactly why Neuroglancer does not provide the option to access GCS via regular google oauth2 login, because it would require users to grant Neuroglancer full read access to all of their GCS resources, which would not normally be a good idea unless they have created a separate Google account with access limited to those datasets they wish to view in Neuroglancer. Instead, there is ngauth, which allows you to grant Neuroglancer access only to specific datasets.

Is it possible with Globus for a user to somehow say: I want to grant this specific Neuroglaner instance access to just these specific datasets? Furthermore, it would be nice if user A can grant that permission, and then share a Neuroglancer link with user B, who also has access to that dataset, and because user A has access to the dataset and already granted access to that dataset to Neuroglancer, and user B has access to the dataset, user B can login and then automatically access it without needing to specifically grant any additional permissions.

@jbottigliero
Copy link
Author

Hey @jbms! Thanks for taking the time to look at this. I'll review the refactor and work to get this code updated if you think we can land on an integration that makes sense – hoping my responses below can help determine that.

[...] can you describe a bit more about how Globus works and how this UUID is handled?

As a high-level background on the functionality, this would be tapping into the following aspects of the Globus Platform:

  • Globus Connect Server (GCS) is an agent installed on end-user systems. The agent acts as an interface to underlying storage systems, the broader Globus platform, GridFTP, and HTTPS.
  • A "Globus Collection" (Collection) is a specific access point to a GCS Endpoint, including path, access control, etc.
    GCS supports HTTPS access to resources in these Collections.
  • Resources are accessed with a combination of permissions configured at the Collection level (with some configuration inheritance) and local file system permissions.
    • I'll expand on this further below in reply to the scope behavior, but a token with a scope for the Collection is required (the scope contains the Collection's UUID as part of the identifier).
    • Globus Auth is then used as a federated authentication system and security fabric.

[...] If the UUID is necessary to access the server, shouldn't it be included in the datasource URL [...]

Currently, a GCS HTTPS URL does not contain enough information to determine the Collection's UUID, and no programmatic method exists for deriving this information — hopefully, this will be possible at some point in the future.

Also, can you clarify how the client ids work and how the default client id you have provided will work? As far as I can gather, users host their own globus server, but they rely on the central globus authentication server?

A registered Globus Application can be configured as an OAuth 2.0 confidential or public client, allowing Globus (Auth) to be used for authentication.

The Client ID included in the code change is a public client I've created1 with the following supported Redirect URIs:

  • https://localhost:8000, https://localhost:8080, https://127.0.0.1:8000, https://127.0.0.1:8080

A token created by the client would only be valid for the scopes requested by the client and approved by the user. Depending on how the data was shared, there are a few different scopes that the client might request:

  • https://auth.globus.org/scopes/<COLLECTION_ID>/https HTTPS access is always required.
  • https://auth.globus.org/scopes/<COLLECTION_ID>/data_access is required for "mapped non-high assurance" collections.2

This means the token only grants the client access to the Collection where the data resides (not necessarily all GCS resources the user has access to)—that Collection might even be a single folder in a file system.

I want to grant this specific Neuroglaner instance access to just these specific datasets? Furthermore, it would be nice if user A can grant that permission, and then share a Neuroglancer link with user B, who also has access to that dataset, and because user A has access to the dataset and already granted access to that dataset to Neuroglancer, and user B has access to the dataset, user B can login and then automatically access it without needing to specifically grant any additional permissions.

One potential solution would be for User A to create a Guest Collection (G1) at a specific path where all datasets to be shared with Neuroglancer would be published. That single collection ID becomes the only requested and granted scope (https://auth.globus.org/scopes/G1/https) by the installation, and in the current code, when User B interacts with Neroglancer, the GlobusLocalStorage.domainMappings would smooth out the UX a little by only asking for the Collection ID once (while available in localStorage).

I know I didn't respond to all of your questions directly, but I hope the explanation above helps determine if this can move forward – I am happy to expand on any of the above!

Footnotes

  1. For transparency, I work for Globus, and my Globus account manages this application. Ideally, when someone is shipping an instance of Neuroglancer they would swap the pre-configured Client ID with an application they manage.

  2. Globus has two types of Collections "Mapped" and "Guest"; For the sake of this integration, I don't think those implementation details matter so much.

@jbms
Copy link
Collaborator

jbms commented Feb 7, 2025

Okay, let me see if I understand things correctly:

  • Globus connect server is a proxy-like server that provides authenticated access to various systems.
  • Note: the use of "GCS" to refer to Globus Connect Server was a bit confusing to me because I am used to "GCS" meaning "Google Cloud Storage".
  • To access data, you need (1) an HTTP URL for the Globus Connect Server, (2) a UUID, (3) authentication credentials, if anonymous access is not enabled.

Is there a 1-1 relationship between Globus collections and UUIIDs? Is there a 1-1 relationship between Globus collections and Globus Connect Server hostnames/origins? Or can there be more than one Globus collection at a given hostname, i.e. an HTTP path is also needed? Or conversely, might the same Globus collection be hosted at more than one Globus Connect Server hostname/path?

You said that there is no way to go from Globus collection HTTP URL to the corresponding UUID, but given just the UUID, can we lookup the HTTP URL for it? This example seems to suggest that we can: https://docs.globus.org/globus-connect-server/v5.4/https-access-collections/#from_the_command_line

@jbottigliero
Copy link
Author

Correct.

  • Yes, a Collection has a distinct UUID.
  • Globus Connect Server has many collections; HTTPS hostnames can be set at the Globus Connect Server level (endpoint) and/or collection level.

You said that there is no way to go from Globus collection HTTP URL to the corresponding UUID, but given just the UUID, can we lookup the HTTP URL for it? This example seems to suggest that we can: https://docs.globus.org/globus-connect-server/v5.4/https-access-collections/#from_the_command_line

Yes, given a Collection UUID and resource (file) path, you can programmatically determine the HTTPS hostname and construct a fully resolvable URL; The current implementation in the PR uses the HTTPS URL since this is more likely to be shared when referencing a specific resource (specifically when targeting HTTPS as the transport).

Note: the use of "GCS" to refer to Globus Connect Server was a bit confusing to me because I am used to "GCS" meaning "Google Cloud Storage".

Fair! Sorry about that!

@jbms
Copy link
Collaborator

jbms commented Feb 7, 2025

What about using the following URL syntax then:

globus+https://<UUID>@<HOSTNAME>/<PATH>

Additionally, if I understand correctly, after you type @ it could auto-complete the hostname, or there could be an alternative syntax for specifying just the UUID and have it lookup the hostname automatically. But perhaps given normal usage of Globus that isn't useful?

Ideally we can find a syntax that works well not only for Neuroglancerbut could also be supported by other tools, along the lines of my zarr proposal ZEP 8 --- see https://github.com/zarr-developers/zeps/pull/48/files

I see from the documentation here (https://docs.globus.org/globus-connect-server/v5/https-access-collections/#supported_http_methods) that directory listing is not supported. That will still work fine but for interactively entering/browsing datasets from Neuroglancer directory listing would certainly be helpful.

@joshmoore
Copy link
Contributor

Two quick points for clarity (from the naive user side):

@jbottigliero jbottigliero force-pushed the feat-globus-datasource branch from 5ceda12 to 6a9e9a1 Compare February 11, 2025 01:20
@jbottigliero
Copy link
Author

What about using the following URL syntax then:

globus+https://<UUID>@<HOSTNAME>/<PATH>

Additionally, if I understand correctly, after you type @ it could auto-complete the hostname, or there could be an alternative syntax for specifying just the UUID and have it lookup the hostname automatically. But perhaps given normal usage of Globus that isn't useful?

As @joshmoore points out, the full HTTPS URL is one of the more common references to be shared by end-users.

I just pushed up an alternative approach that would support globus+https://<https_asset_url>.

I'm using a request to the Globus Connect Server host that will result in a well-known error in order to derive the required scopes (and Collection ID, indirectly).

I still need to do a closer review of the code changes to make sure this fits in well with the new provider patterns (and write some documentation), but just wanted to share this as a possible solution.

@jbottigliero jbottigliero changed the title feat: adds Globus GCS-sourced assets as a datasource feat: adds Globus Connect Server-sourced assets as a datasource (via HTTPS) Feb 11, 2025
@jbms
Copy link
Collaborator

jbms commented Feb 11, 2025

Detecting the UUID automatically sounds like a big improvement --- thanks!

Thinking a bit more about the client ids:

  1. Either a single client id gets baked into the build (i.e. specific deployment of neuroglancer), or
  2. there is some mechanism to configure the client id, either in the JSON state or via user local storage.

Option 1 is how such client ids are normally handled.

For any custom deployment of neuroglancer, whoever is deploying it can register their own globus application and option 1 works quite well.

However, with the current application id provided by @jbottigliero / Globus, the default deployment of neuroglancer at https://neuroglancer-demo.appspot.com will be unusable with Globus, which seems rather unfortunate.

My inclination is to instead deploy to neuroglancer-demo.appspot.com with a working client id, that allows both neuroglancer-demo.appspot.com as well as localhost. That way, if users wish to use the default deployment of neuroglancer with globus, they can do so.

Additionally, from a security perspective, I don't see any theoretical advantage of allowing the user to specify their own client id --- if the deployment is controlled by a malicious adversary, the user having entered their own client id doesn't provide any protection, because either way the adversary can obtain the credentials.

Potentially someone may wish to use globus via a localhost deployment for testing but not grant access to neuroglancer-demo.appspot.com, and also doesn't want to have to bother to register their own globus client id. In that case, a localhost-only client id provided directly by Globus, i.e. the one currently included in this PR, would still be useful. It could make sense to make that one the default one but override it when deploying to neuroglancer-demo.appspot.com.

{
method: "GET",
headers: {
"X-Requested-With": "XMLHttpRequest",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this header being included? it will result in an additional preflight OPTIONS request.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This header is used to signal "Programmatic Access" (https://docs.globus.org/globus-connect-server/v5.4/https-access-collections/#programmatic_access) which ensures an Content-Type: application/json response. I think this is managed as a separate header than just standard Accept because the underlying asset is responsible for content negotiation when served over HTTPS.

I can add a comment here that explains this with a reference to the documentation.

const challenge = await generateCodeChallenge(verifier);
const url = getGlobusAuthorizeURL({
clientId,
scope: authorization_parameters.required_scopes,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that there may be a security vulnerability here --- suppose the user has previously accessed:

globus+https://good-host.com/... which has UUID XXXX

and granted access.

Then the user gets directed to visit a Neuroglancer link that specifies a datasource of globus+https://bad-host.com/... bad-host.com maliciously reports that it has the same UUID of XXXX as good-host.com.

What will happen in that case? Will the user have to grant permission again or will it be assumed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, it does seem like a plausible vector. In the updated processing a server that spoofs the required_scopes of good-host.com would result in tokens for the good-host.com being sent as Authorization headers to bad-host.com.

As a way to validate the required_scope response, it seems plausible to do a reverse lookup of the domain using the the scopes in the response, but this might require additional initial consent – I'm going to discuss this internally at Globus to see if we can come up with a more secure alternative.

) as GlobusLocalStorage;
}

async function waitForAuth(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refactor this to use newly added utilities in src/credentials_provider/interactive_credentials_provider.ts

@jbms
Copy link
Collaborator

jbms commented Feb 11, 2025

Please move it from src/datasource to src/kvstore. This is considered a "root key-value store" under the current terminology used in the neuroglancer docs.

@joshmoore
Copy link
Contributor

Additionally, from a security perspective, I don't see any theoretical advantage of allowing the user to specify their own client id --- if the deployment is controlled by a malicious adversary, the user having entered their own client id doesn't provide any protection, because either way the adversary can obtain the credentials.

My concern would really be the impression of users who come to the system. Thinking of my installation, I would rather users see the authority figure they expect (i.e., me) when being asked to use oauth for an application.

@jbms
Copy link
Collaborator

jbms commented Feb 11, 2025

Additionally, from a security perspective, I don't see any theoretical advantage of allowing the user to specify their own client id --- if the deployment is controlled by a malicious adversary, the user having entered their own client id doesn't provide any protection, because either way the adversary can obtain the credentials.

My concern would really be the impression of users who come to the system. Thinking of my installation, I would rather users see the authority figure they expect (i.e., me) when being asked to use oauth for an application.

To be clear, by user I mean whoever is using the browser. If you are deploying your own neuroglancer instance then you would anyway need to specify your own globus client id at build time.

If no one cares to use globus with the default neuroglancer instance then we could just not worry about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants