Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesign Airlock Export Process to Eliminate SAS URLs #4308

Open
TonyWildish-BH opened this issue Feb 4, 2025 · 25 comments
Open

Redesign Airlock Export Process to Eliminate SAS URLs #4308

TonyWildish-BH opened this issue Feb 4, 2025 · 25 comments
Labels

Comments

@TonyWildish-BH
Copy link
Contributor

The airlock process is effective, but cumbersome. In particular, the need to use SAS URLs inside the w/s VMs means we can't completely block pasting into the VMs, which is something we'd like to do by default - only allowing it on a per-case basis, but that's another ticket.

The need for a SAS URL inside the workspace could be eliminated if the process were redesigned. Once an import is approved, there's no reason to only make access to it ephemeral, it makes sense to have the file accessible for the lifetime of the project. The file can be pushed to the/a shared storage directly, so it's immediately accessible from all machines, eliminating the need for the storage explorer on import.

For exports, a staging storage space can be made available, and the user given access to it from within the workspace. The act of pushing a file there can be used to trigger creation of a draft export request, with the file being automatically moved to an inaccessible (to the user) storage which preserves the NONCE semantics.

This is related to #2402, about the need for access to the UI from within the workspace. However, this simplification of the airlock process is worth it on its own, regardless of that issue. It will greatly improve the user experience.

@marrobi
Copy link
Member

marrobi commented Feb 4, 2025

I'll ask others to comment, but believe the reason for dedicated storage per request was as we had a scenario where the PI wanted to do the import, and not share it with others in the workspace, hence it must not touch the shared storage.

We could have the option to automatically transfer the file. Maybe a destination for the request.

As for automating the creation, there is an amount of metadata needed alongside a request, how would that get provided?

@TonyWildish-BH
Copy link
Contributor Author

What metadata is required that isn't available in the workspace already? Presumably a function triggered by the file upload can access that metadata, something similar must be happening now, unless I've misunderstood the process?

@jonnyry
Copy link
Collaborator

jonnyry commented Feb 4, 2025

Throwing in my two pennies worth:

  • It would be a smoother experience if the researcher could upload files to / download files from the TRE portal itself to import and export from the Airlock. Realise there's a lot of work to make that happen.

  • Could we move to Entra auth on the airlock storage accounts and do away with the SAS key altogether? Is the time bound SAS token necessary?

  • Could we rationalise/reduce the number of storage account used the by airlock? I understand different network settings are needed on some accounts - for example the export needs to be publicly available. Though could some accounts be shared (e.g. blocked/rejected/in progress) using different containers to compartmentalise the data?

The number of accounts (particularly the workspace ones) seems to be a bit of a bind on scalability, and also ups the cost when Defender is part of the mix (although you can create exclusions).

Airlock storage accounts - core

Name Description
st + airlockp + <TRE_ID> Airlock Processor
st + alexapp + <TRE_ID> Airlock Export Approved
st + alimblocked + <TRE_ID> Airlock Import Blocked
st + alimex + <TRE_ID> Airlock Import External
st + alimip + <TRE_ID> Airlock Import In Progress
st + alimrej + <TRE_ID> Airlock Import Rejected

Airlock storage accounts - per workspace

Name Description
st + alexblocked + ws + <WS_ID> Airlock Export Blocked
st + alexint + ws + <WS_ID> Airlock Export Internal
st + alexip + ws + <WS_ID> Airlock Export In Progress
st + alexrej + ws + <WS_ID> Airlock Export Rejected
st + alimapp + ws + <WS_ID> Airlock Import Approved

@fortunkam
Copy link

It would be a smoother experience if the researcher could upload files to / download files from the TRE portal itself to import and export from the Airlock. Realise there's a lot of work to make that happen.

I have a branch I am working on which will enable this exact scenario (client side storage access using RBAC), will tag this issue once that PR is ready.

@marrobi
Copy link
Member

marrobi commented Feb 4, 2025

@fortunkam can you create an issue with details so we can assign it to you? thanks.

@TonyWildish-BH
Copy link
Contributor Author

It would be a smoother experience if the researcher could upload files to / download files from the TRE portal itself to import and export from the Airlock. Realise there's a lot of work to make that happen.

I have a branch I am working on which will enable this exact scenario (client side storage access using RBAC), will tag this issue once that PR is ready.

That sounds useful in some circumstances, but not all. We often have data sitting on VMs in Azure that needs to be imported, so pulling it locally and then uploading through the browser isn't a good solution in those cases.

Are there any limits on file size for upload via the UI? Now that the malware checking limit of 2 GB has been raised (I've tested up to 50 GB, just for fun), it would be good to know if the browser or UI impose their own limits.

@fortunkam
Copy link

I tested the client side file upload with files up to 2GB. Will give it a go with a larger file once I pulled the changes in.

@fortunkam
Copy link

See #4309

@TonyWildish-BH
Copy link
Contributor Author

thanks. Please don't remove the ability to upload via the CLI, we need that for cases where we don't have a browser, such as having the data on a Linux VM in the cloud. It's not always feasible or desirable to pull the data locally and then push it through the browser.

@marrobi
Copy link
Member

marrobi commented Feb 4, 2025

Thanks @fortunkam . Can any "upload via UI discussion, happen here - #4309 .

What are the other points to discuss here?

  • Restructuring airlock storage accounts to use fewer accounts
  • Auto creation of airlock requests from a file drop
  • SAS token vs RBAC for file upload?

Is that it? I think these are likely separate issues with separate implementation plans.

@marrobi
Copy link
Member

marrobi commented Feb 7, 2025

@TonyWildish-BH #4335 might be of interest?

@TonyWildish-BH
Copy link
Contributor Author

@TonyWildish-BH #4335 might be of interest?

Thanks @marrobi. That's an option for imports, yes, but I'd also like to make exports easier.

@marrobi
Copy link
Member

marrobi commented Feb 10, 2025

@TonyWildish-BH @jonnyry I'm going rename this to "Redesign Airlock Export Process to Eliminate SAS URLs"

If feel this misrepresents the issue, let me know. If there are other "airlock issues" please create new issues.

@marrobi marrobi changed the title Proposed change to airlock process Redesign Airlock Export Process to Eliminate SAS URLs Feb 10, 2025
@jonnyry
Copy link
Collaborator

jonnyry commented Feb 11, 2025

Opened seperate ticket for my point above on number of storage accounts #4358

@marrobi
Copy link
Member

marrobi commented Feb 11, 2025

So think

@TonyWildish-BH can I close this issue in favour of these new issues? If another one is required, please suggest it and we can close this conversation. Thanks.

@TonyWildish-BH
Copy link
Contributor Author

TonyWildish-BH commented Feb 12, 2025

@marrobi, #4335 and #4358 are fine, but #4309 is not a complete solution for us.

Using the UI only works if the file to upload, or the download destination, are locally available. We frequently push data through VMs in the cloud, and use the CLI to upload from or download to those VMs. Using the UI isn't always an option there.

I'll comment on #4309, but until that covers all the bases, I don't think this ticket should be closed.

@marrobi
Copy link
Member

marrobi commented Feb 12, 2025

@TonyWildish-BH can you be specific about what this ask is for? I'm a little confused.

Is it "Trigger creation of a draft export request when file is dropped in specific location in workspace shared storage"?

If so can we rename the issue?

@TonyWildish-BH
Copy link
Contributor Author

@marrobi, the current title is on the mark, I'd like to eliminate the use of SAS URLs. In particular, I'd like to eliminate their use inside the TRE.

There are two reasons for this:

  1. The primary reason is that we want to be able to disable pasting into the workspace, and eliminating SAS URLs is a requirement for that.
  2. The secondary reason is that the airlock process is very cumbersome, and eliminating SAS URLs can be done in a way that makes the user experience much more pleasant.

The "Trigger creation..." bit is just one suggestion, I'm open to other ideas. I don't much mind how it's done as long as it satisfies those two requirements.

Hope that clarifies things.

@West-P
Copy link

West-P commented Feb 12, 2025

When creating the storage account can we leverage the requesting user's identity to add a role assignment on the blob container?

The Airlock Manager role could also be added. Potentially this could be done via a group (if group creation is enabled when creating a workspace) as a workspace could have multiple airlock managers.

This would allow the authentication to the container via EntraID rather than SAS token.

The UI could then give the container URL rather than the full SAS token.
However, the airlock help files would also need updating as a different process is needed if connecting via Storage Explorer/CLI and/or the Azure Portal.

It would remove the need for a SAS token, however I don't think it makes the process any more efficient for the end user, other than removing the knowledge barrier of what to do with a SAS token.


Addition to the above.
What authentication is used for the airlock security scan/s? Does this currently use the SAS to scan when an airlock request is submitted. If it does a different method would need to be used such as using a Service Principal with a role assignment on the storage account.

@marrobi
Copy link
Member

marrobi commented Feb 12, 2025

Can we grant RBAC access at file level basis? I think that is why SAS tokens were used.

@West-P
Copy link

West-P commented Feb 13, 2025

Can we grant RBAC access at file level basis? I think that is why SAS tokens were used.

No you cannot. You can use ACLs if using Data Lake per container.
However, could the airlock process change to creating a storage account per request?

Again, not ideal as then if you had 100 requests you then have 100 storage accounts floating around.
The storage account would need to be destroyed if a request is denied and what happens to a storage account after the request is approved.....well I guess it just persists, unless you setup a schedule whereby an approved request gives access into the TRE workspace for x days and after x days the storage account is removed.

But it all comes back to SAS tokens essentially.
Trying to figure out ways around it but it makes sense why SAS tokens were used in the first place.

@marrobi
Copy link
Member

marrobi commented Feb 13, 2025

@West-P there are quota on number of storage accounts per subscription, we already run into these when have large number of workspaces. So storage account per request isn't a good idea.

I think the only solution is automatic transfer to/from a shared area, but this depends if it is appropriate for the data in the request to be accessible be all users in the workspace.

@West-P
Copy link

West-P commented Feb 13, 2025

@West-P there are quota on number of storage accounts per subscription, we already run into these when have large number of workspaces. So storage account per request isn't a good idea.

I think the only solution is automatic transfer to/from a shared area, but this depends if it is appropriate for the data in the request to be accessible be all users in the workspace.

Yes I can see that being an issue.
Only other theories which I haven't added yet as I can't think of an end to end solution for is using Azure Files or using a small Data disk to mount to VMs for the process.

There are a lot of issues with both though.
Azure files would mean storage explorer could be used however the permissions become a bit tricky with having to create a new share per request.
Data disk may be useful however doing security scans on these may be an issue. Plus you would need to then enable the disk once it is mounted onto a VM (could be automated) but then this takes away the ability to az copy into it.

@jonnyry
Copy link
Collaborator

jonnyry commented Feb 13, 2025

Could this be used? Looks like a new feature...

https://learn.microsoft.com/en-us/azure/role-based-access-control/conditions-overview

@marrobi
Copy link
Member

marrobi commented Feb 13, 2025

Potentially. I'm not sure it resolves the SAS situation, but might help the multiple storage accounts (#4358) as can add conditions on each blob/container based on the private endpoint its being accessed from. I did test it a few months ago at a high level and looked promising.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants