Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workspace deploy fails because storage account is not unique #2893

Open
t-young31 opened this issue Nov 18, 2022 · 16 comments · Fixed by #3667
Open

Workspace deploy fails because storage account is not unique #2893

t-young31 opened this issue Nov 18, 2022 · 16 comments · Fixed by #3667
Assignees
Labels
bug Something isn't working has workaround a workaround is available for this issue

Comments

@t-young31
Copy link
Contributor

Describe the bug
A clear and concise description of what the bug is.

I've just deployed a workspace and get

mError: \u001b[0m\u001b[0m\u001b[1mcreating Azure Storage Account \"stgws2a73\": storage.AccountsClient#Create: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code=\"StorageAccountAlreadyTaken\" Message=\"The storage account named stgws2a73 is already taken.

looks like the last 4 chars are maybe not unique enough. Seems odd though if it's ~1/10^6 chance

Steps to reproduce

  1. Deploy a workspace
  2. Be unlucky!
@t-young31 t-young31 added the bug Something isn't working label Nov 18, 2022
@marrobi
Copy link
Member

marrobi commented Nov 18, 2022

Very unlucky - depends how many TRE workspaces have been deployed. Maybe we can add the TRE ID into some of these identifiers.

@RIashfaqahmed
Copy link

Very unlucky - depends how many TRE workspaces have been deployed. Maybe we can add the TRE ID into some of these identifiers.

thats a good idea!

@JaimieWi
Copy link
Contributor

Just to mention, we have also come across this issue three times in the last week!

@martinpeck
Copy link
Member

FYI this is where we generate that name: https://github.com/microsoft/AzureTRE/blob/main/templates/workspaces/base/terraform/locals.tf#L4

We currently take the last 8 characters from a string that already includes the TRE ID. If we took more than 8 (e.g. like we do for the KV name) then we'd pick up the TRE ID as part of the storage account name. However, we have a limit of 24 characters....but that should still be fine.

@tamirkamara
Copy link
Collaborator

The root cause of this is using last 4 chars of a uuid (v4=random) for azure resources that need to be globally unique. The complicated factor is that templates make naming assumptions on resources they reference that were created by a parent template. This is also true for some of our python code mainly around the airlock (but is easier to fix).

We discussed a couple of options to resolve this:

  1. Increase the uniqueness level of the chars we use but keep the approach of the template developer making naming assumptions. e.g. use more chars of the uuid string (probably not going to help that much), use a real random string rather than the uuid4, etc.
  2. Having a way for the template to get the properties of its parents. The pipeline mechanism has something similar but for a specific use-case which we can consider making more generic.
    A template could generate its own random string as it see fit, which will be saved in cosmos (via rp outputs). A descendant template could then reference the property and the pipeline could replace that before sending the operation to the RP.
    This method will be challenging for situations where we have cross workspace references. Maybe in Airlock review? TBD

@marrobi
Copy link
Member

marrobi commented Jan 31, 2023

Hit this again today...

Status=<nil> Code=\"StorageAccountAlreadyTaken\" Message=\"The storage account named stgws07e4 is already taken.\"\u001b[0m \u001b[31m│\u001b[0m \u001b[0m \u001b[31m│\u001b[0m \u001b[0m\u001b[0m  with azurerm_storage_account.stg, \u001b[31m│\u001b[0m \u001b[0m  on storage.tf line 1, in resource \"azurerm_storage_account\" \"stg\": \u001b[31m│\u001b[0m \u001b[0m   1: resource \"azurerm_storage_account\" \"stg\" \u001b[4m{\u001b[0m\u001b[0m \u001b[31m│\u001b[0m \u001b[0m \u001b[31m╵\u001b[0m\u001b[0m \u001b[31m╷\u001b[0m\u001b[0m \u001b[31m│\u001b[0m \u001b[0m\u001b[1m\u001b[31mError: \u001b[0m\u001b[0m\u001b[1mcreating Azure Storage Account \"stalimappws07e4\": storage.AccountsClient#Create: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code=\"StorageAccountAlreadyTaken\" Message=\"The storage account named stalimappws07e4 is already taken.

@marrobi
Copy link
Member

marrobi commented Apr 12, 2023

@tamirkamara where are we with this? Is #3243 a solution?

People keep hitting it, would be good to get a fix in - if no 2 above is to complex, then can we maybe get a more random ID as per 1?

@tamirkamara
Copy link
Collaborator

tamirkamara where are we with this? Is #3243 a solution?

People keep hitting it, would be good to get a fix in - if no 2 above is to complex, then can we maybe get a more random ID as per 1?

We have implemented no2 and it's merged.
However we haven't yet managed to go and update all the templates. Plus you had a concern about having the unique string as a property in the template. So I'm not sure when/if/how we'll proceed.

@marrobi
Copy link
Member

marrobi commented Aug 10, 2023

This is coming up again as an issue. I think will look at a temporary fix just to check the last 4 digits is unique, then longer term when time allows, we can go through and update all the templates.

What was the conversation about the ID being a property? A property of the workspace service?

@marrobi
Copy link
Member

marrobi commented Feb 29, 2024

@SvenAelterman I think we need a short term workaround for this, I have tested something like:

    async def is_workspace_with_last_4_id(self, workspace_id: str) -> bool:
        from azure.mgmt.storage import StorageManagementClient

        storage_client = StorageManagementClient(credentials.get_credential(), config.SUBSCRIPTION_ID)
        # check for storage account with last 4 digits of workspace_id
        availability_result = storage_client.storage_accounts.check_name_availability(
            {
                "name": f"stgws{workspace_id[-4:]}"
            }
        )
        return not availability_result.name_available

Which works, but as per the discussion above we need to have a more robust solution.

Which could be as you suggested in #3862, use the tre_id in names:

stgws<tre_id>ws<ws_short_id>

But we also need to adjust templates to use outputs of parent templates, such as the workspace storage account name, rather than a hardcoded format. There is also the airlock to consider as per #3243.

I propose a workaround, then we schedule getting this sorted for good, which will be a breaking change on the base workspace and child templates.

@tim-p-allen
Copy link
Collaborator

PR #3863

@marrobi
Copy link
Member

marrobi commented May 29, 2024

Had a chat with people about this and the view is using unique IDs for the full name - all characters, and then tags for the descriptions might be the best approach. If we use the TRE ID then we are going to start running out of characters.

@marrobi
Copy link
Member

marrobi commented Jan 23, 2025

As per #3243 the plan was for a workspace to have a unique_identifier_suffix property.

@TonyWildish-BH
Copy link
Contributor

Bringing over my comments from #4298:

There are a number of tickets related to scalability and naming of things like resource groups or storage accounts, all of which boil down to the fact that relying on naming conventions for resources limits the namespace and causes collisions, which limits scalability. E.g., #3921, #2893, #3666, #3862, #3243, #1073.

The fixes proposed for these issues have all relied on increasing randomness in the suffix, appending unique ids which then get passed around etc. I think this is not the best approach going forwards for two main reasons:

  1. using deterministic names inevitably means reduced namespace for resources, since there are fewer characters left to carry the randomness, which will lead (has led) to collisions. Some proposals include embedding the tre_id in the name, but that's not adequate either, it assumes we'll all have unique tre_id's, whereas it's likely we're all calling them sde or tre.
  2. using deterministic names which are calculated in several places in the code leads to brittle code. Changing the calculated name becomes very difficult as every place where it's calculated has to be found and changed.
    A better approach would be to calculate the name once, store it somewhere (e.g. terraform state, cosmos DB, ...), then reference it from there. Where the value gets stored will depend on the downstream uses for it, but the principle should be to have a SSOT wherever possible.

This can make the code more complex in some ways, of course, since looking something up rather than calculating it is more work, but it's essential to fix this problem or we can't scale much beyond where we are today.

Some of the fixes proposed rely on the assumption that the tre_id is globally unique, but simply stating that assumption doesn't make it so. That puts the burden on whoever creates a TRE to pick a unique ID, with no knowledge of what IDs other people have used. The best strategy then is to pick a 12-digit random string, which could be done for us by the code.

I think the discussion in most of the tickets don't get to the core issue, which is the brittle code that results from calculated names. Better management of naming metadata is needed, i.e. a single source of truth for everything and appropriate lookup mechanisms when the name is required.

@tamirkamara
Copy link
Collaborator

@TonyWildish-BH As you saw, over time we devised a few ways to increase the randomness of those strings to avoid a full rework that will span across the entire system due to the complexity and time investment needed. How we devise the random name is the small part of this - how we make the entire system aware and able to reach different resources (like storage accounts) is the bigger challenge.

As pointed out by Marcus, #3243 is a start, along side with the ability of the system that is already there to get "parent" resource properties is a big step forward. But due to other priorities we couldn't complete the implementation end to end.
In general, this relies on a workspace having a unique suffix that could later be used for anything we like. The challenge is to see that all things line up and every part of the system can use it.

@TonyWildish-BH
Copy link
Contributor

thanks for the comments, @tamirkamara. Personally, I think the time is right for a rethink of the entire system. There are a number of users who are starting to scale up activities, and any refactoring done now will save a lot of time, effort, and complexity later on.

Instead of calculating resource names according to more complex formulae that bring in their own restrictions (e.g. tre_id length and uniqueness), why not just pass around the resource-IDs (UUIDs) of all the resources required for an operation, and pull the information out of the terraform state for those resources. That should be straightforward, and I'm willing to bet that will simplify the code significantly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working has workaround a workaround is available for this issue
Projects
10 participants