Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallel-workload: 0dt: thread 'coordinator' panicked at src/storage-controller/src/lib.rs:703:17: dependency since has advanced past dependent (u417) upper #28885

Open
Tracked by #27406
def- opened this issue Aug 8, 2024 · 6 comments · May be fixed by #29107
Assignees
Labels
C-bug Category: something is broken ci-flake T-platform-v2 Theme: Platform v2

Comments

@def-
Copy link
Contributor

def- commented Aug 8, 2024

What version of Materialize are you using?

736303f (Pull Request #28873)

What is the issue?

Serious version of #28634 in Parallel Workload (0dt deploy)

thread 'coordinator' panicked at /var/lib/buildkite-agent/builds/buildkite-builders-aarch64-585fc7f-i-032de4cf07f9f85e9-1/materialize/test/src/storage-controller/src/lib.rs:703:17:
dependency since has advanced past dependent (u417) upper 

                            dependent (u417): upper Antichain { elements: [1723129772001] } 

                            dependency since Antichain { elements: [] }
stack backtrace:
   0: rust_begin_unwind
             at ./rustc/051478957371ee0084a7c0913941d2a8c4757bb9/library/std/src/panicking.rs:652:5
   1: core::panicking::panic_fmt
             at ./rustc/051478957371ee0084a7c0913941d2a8c4757bb9/library/core/src/panicking.rs:72:14
   2: <mz_storage_controller::Controller<mz_repr::timestamp::Timestamp> as mz_storage_client::controller::StorageController>::create_collections_for_bootstrap::{closure#0}::{closure#0}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-aarch64-585fc7f-i-032de4cf07f9f85e9-1/materialize/test/src/storage-controller/src/lib.rs:703:17
   3: <tracing::instrument::Instrumented<<mz_storage_controller::Controller<mz_repr::timestamp::Timestamp> as mz_storage_client::controller::StorageController>::create_collections_for_bootstrap::{closure#0}::{closure#0}> as core::future::future::Future>::poll
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
   4: <mz_storage_controller::Controller<mz_repr::timestamp::Timestamp> as mz_storage_client::controller::StorageController>::create_collections_for_bootstrap::{closure#0}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-aarch64-585fc7f-i-032de4cf07f9f85e9-1/materialize/test/src/storage-controller/src/lib.rs:491:5
   5: <core::pin::Pin<alloc::boxed::Box<dyn core::future::future::Future<Output = core::result::Result<(), mz_storage_types::controller::StorageError<mz_repr::timestamp::Timestamp>>>>> as core::future::future::Future>::poll
             at ./rustc/051478957371ee0084a7c0913941d2a8c4757bb9/library/core/src/future/future.rs:123:9
   6: <mz_adapter::coord::Coordinator>::bootstrap_storage_collections::{closure#0}::{closure#0}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-aarch64-585fc7f-i-032de4cf07f9f85e9-1/materialize/test/src/adapter/src/coord.rs:2449:14
   7: <tracing::instrument::Instrumented<<mz_adapter::coord::Coordinator>::bootstrap_storage_collections::{closure#0}::{closure#0}> as core::future::future::Future>::poll
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
   8: <mz_adapter::coord::Coordinator>::bootstrap_storage_collections::{closure#0}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-aarch64-585fc7f-i-032de4cf07f9f85e9-1/materialize/test/src/adapter/src/coord.rs:2346:5
   9: <mz_adapter::coord::Coordinator>::bootstrap::{closure#0}::{closure#0}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-aarch64-585fc7f-i-032de4cf07f9f85e9-1/materialize/test/src/adapter/src/coord.rs:1824:14
  10: <tracing::instrument::Instrumented<<mz_adapter::coord::Coordinator>::bootstrap::{closure#0}::{closure#0}> as core::future::future::Future>::poll
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
  11: <mz_adapter::coord::Coordinator>::bootstrap::{closure#0}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-aarch64-585fc7f-i-032de4cf07f9f85e9-1/materialize/test/src/adapter/src/coord.rs:1727:5
  12: mz_adapter::coord::serve::{closure#0}::{closure#2}::{closure#0}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-aarch64-585fc7f-i-032de4cf07f9f85e9-1/materialize/test/src/adapter/src/coord.rs:3728:26
  13: <tracing::instrument::Instrumented<mz_adapter::coord::serve::{closure#0}::{closure#2}::{closure#0}> as core::future::future::Future>::poll
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
  14: <tokio::runtime::park::CachedParkThread>::block_on::<tracing::instrument::Instrumented<mz_adapter::coord::serve::{closure#0}::{closure#2}::{closure#0}>>::{closure#0}
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/park.rs:281:63
  15: tokio::runtime::coop::with_budget::<core::task::poll::Poll<core::result::Result<(), mz_adapter::error::AdapterError>>, <tokio::runtime::park::CachedParkThread>::block_on<tracing::instrument::Instrumented<mz_adapter::coord::serve::{closure#0}::{closure#2}::{closure#0}>>::{closure#0}>
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/coop.rs:107:5
  16: tokio::runtime::coop::budget::<core::task::poll::Poll<core::result::Result<(), mz_adapter::error::AdapterError>>, <tokio::runtime::park::CachedParkThread>::block_on<tracing::instrument::Instrumented<mz_adapter::coord::serve::{closure#0}::{closure#2}::{closure#0}>>::{closure#0}>
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/coop.rs:73:5
  17: <tokio::runtime::park::CachedParkThread>::block_on::<tracing::instrument::Instrumented<mz_adapter::coord::serve::{closure#0}::{closure#2}::{closure#0}>>
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/park.rs:281:31
  18: <tokio::runtime::context::blocking::BlockingRegionGuard>::block_on::<tracing::instrument::Instrumented<mz_adapter::coord::serve::{closure#0}::{closure#2}::{closure#0}>>
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/context/blocking.rs:66:9
  19: <tokio::runtime::handle::Handle>::block_on::<mz_adapter::coord::serve::{closure#0}::{closure#2}::{closure#0}>::{closure#0}
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/handle.rs:310:22
  20: tokio::runtime::context::runtime::enter_runtime::<<tokio::runtime::handle::Handle>::block_on<mz_adapter::coord::serve::{closure#0}::{closure#2}::{closure#0}>::{closure#0}, core::result::Result<(), mz_adapter::error::AdapterError>>
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/context/runtime.rs:65:16
  21: <tokio::runtime::handle::Handle>::block_on::<mz_adapter::coord::serve::{closure#0}::{closure#2}::{closure#0}>
             at ./cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/handle.rs:309:9
  22: mz_adapter::coord::serve::{closure#0}::{closure#2}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-aarch64-585fc7f-i-032de4cf07f9f85e9-1/materialize/test/src/adapter/src/coord.rs:3721:33
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

ci-regexp: dependency since has advanced past dependent

@def- def- added C-bug Category: something is broken T-platform-v2 Theme: Platform v2 labels Aug 8, 2024
aljoscha added a commit to aljoscha/materialize that referenced this issue Aug 13, 2024
@def-
Copy link
Contributor Author

def- commented Aug 14, 2024

We can use CI Failures to see when this occurs again with more context. So far nothing.

@aljoscha
Copy link
Contributor

aljoscha commented Aug 16, 2024

Aha! we have one: https://buildkite.com/materialize/nightly/builds/9115#01915669-0a3f-438b-be63-2bc0b7b1b8bd

The relevant log snippets from that one are:

parallel-workload-materialized-1     | environmentd: 2024-08-15T15:07:25.976061Z  INFO coord::handle_message{kind="purified_statement_ready"}:sequence_create_source:coord::catalog_transact_with_side_effects:coord::catalog_transact_inner:catalog::transact:catalog::transact_inner:transact_op: mz_adapter::catalog::transact: create source db-pw-1723733000-0.s-14.kafka_table57 (u319)

parallel-workload-materialized2-1    | environmentd: 2024-08-15T15:07:33.881371Z  INFO environmentd::run:environmentd::serve:adapter::serve:coord::coordinator: mz_adapter::catalog::open: startup: controller init: beginning
parallel-workload-materialized2-1    | environmentd: 2024-08-15T15:07:33.886867Z  INFO environmentd::run:environmentd::serve:adapter::serve:coord::coordinator:controller::new: mz_controller: starting controllers in read-only mode!

parallel-workload-materialized-1     | environmentd: 2024-08-15T15:07:35.907644Z  INFO coord::handle_message{kind="command-execute"}:message_command:coord::handle_execute{session="3136aea1-fc34-4a14-8076-1cd95dc84fd1"}:coord::handle_execute_inner{stmt="DROP SOURCE \"db-pw-1723733000-0\".\"s-14\".kafka_table57"}:sequence_drop_objects:coord::catalog_transact:coord::catalog_transact_conn:coord::catalog_transact_inner:catalog::transact:catalog::transact_inner:transact_op: mz_adapter::catalog::transact: drop source db-pw-1723733000-0.s-14.kafka_table57 (u319)
parallel-workload-materialized-1     | environmentd: 2024-08-15T15:07:36.167253Z  INFO mz_storage_client::storage_collections: removing collection state because the since advanced to []! id=u319
parallel-workload-materialized-1     | environmentd: 2024-08-15T15:07:36.176158Z  INFO mz_storage_client::storage_collections: removing persist handles because the since advanced to []! id=u319
parallel-workload-materialized-1     | environmentd: 2024-08-15T15:07:36.176173Z  INFO mz_storage_client::storage_collections: enqueing shard finalization due to dropped collection and dropped persist handle id=u319 dropped_shard_id=s0e63c9cf-f200-48f0-9a77-5541f63a175f

parallel-workload-materialized2-1    | environmentd: 2024-08-15T15:07:37.477783Z  INFO mz_service::grpc: GrpcClient /tmp/3964e4004e68513c22fd04b478d92a227a0a66b5: connected
parallel-workload-materialized2-1    | thread 'coordinator' panicked at /var/lib/buildkite-agent/builds/buildkite-builders-aarch64-585fc7f-i-085e762fefce6d863-1/materialize/nightly/src/storage-controller/src/lib.rs:699:17:
parallel-workload-materialized2-1    | dependency since has advanced past dependent (u319) upper 
parallel-workload-materialized2-1    | 
parallel-workload-materialized2-1    |                             dependent (u319): upper Antichain { elements: [1723734456001] } 
parallel-workload-materialized2-1    | 
parallel-workload-materialized2-1    |                             dependency since Antichain { elements: [] } 
parallel-workload-materialized2-1    | 
parallel-workload-materialized2-1    |                             dependency read holds: [ReadHold { id: User(319), since: Antichain { elements: [] }, .. }, ReadHold { id: User(318), since: Antichain { elements: [0] }, .. }]

edit: See diagnosis below!

So looks like that shard/collection has been dropped and is in the process of finalization. But then we also try and bootstrap it in the storage controller. In the controller, we can see from the read holds that the since is already the empty antichain, so we couldn't read it anymore.

I'll try and get to the bottom of how this can happen, but I still think it's not an issue for now.

@aljoscha
Copy link
Contributor

I believe I have this diagnosed! What happens is this:

  • materialized is the past deploy generation
  • materialized2 is the generation that we're upgrading to

I updated the log snippets above to basically lay out precisely what happens:

  1. materialized creates the (kafka) source u319
  2. materialized2 comes up in read-only mode, reads the most recent catalog snapshot as of that time, starts controllers
  3. materialized drops the source u319, initiates shard finalization which advances the since to []
  4. materialized2 keeps its (now-outdated) catalog state, that still contains u319 and tries to initialize its storage controller state, then notices that the sinces are no good for the upper of that source -> panic

Thoughts:

  • I don't think we can prevent this, as long as we keep state in different places. Here: the catalog shard and frontiers in individual persist shards. We can't update these atomically.
  • It's not a problem, as long as we don't downgrade the since to [] while a collection is still needed according to catalog state. I'd say even if we did that it wouldn't be a problem: it is legal to have a collection with since=[], it just means no-one can read from it.
  • I think the solution is to downgrade this assert
    mz_ore::soft_assert_or_log!(
    to not check frontiers when the since is empty. Because no one can observe changes to the shard anymore. And to also not run ingestions for collections whose since is empty.
  • This ☝️, however, might lead to panics in other places. For example, a materialized view that builds on this source might now realize that the sinces of its dependencies (including that Kafka source, say) are now no good anymore. But again, that materialized view would already have been dropped in the latest catalog state, otherwise its (and its dependencies) sinces would not have been allowed to go to empty.

@benesch
Copy link
Member

benesch commented Aug 16, 2024

  • This ☝️, however, might lead to panics in other places. For example, a materialized view that builds on this source might now realize that the sinces of its dependencies (including that Kafka source, say) are now no good anymore. But again, that materialized view would already have been dropped in the latest catalog state, otherwise its (and its dependencies) sinces would not have been allowed to go to empty.

Is this something we need to solve up front, or are you proposing that we downgrade the assert and then see if anything like this happens before trying to address it?

@benesch
Copy link
Member

benesch commented Aug 16, 2024

or are you proposing that we downgrade the assert and then see if anything like this happens before trying to address it?

(Because I'm very in favor of this.)

@aljoscha
Copy link
Contributor

or are you proposing that we downgrade the assert and then see if anything like this happens before trying to address it?

(Because I'm very in favor of this.)

That's my preferred approach, yes!

aljoscha added a commit to aljoscha/materialize that referenced this issue Aug 19, 2024
Fixes MaterializeInc#28885

With this change, we might now get panics in other places where we make
assumptions about sinces/uppers that don't hold in read-only mode or
when there are concurrent coordinator-like processes.
aljoscha added a commit to aljoscha/materialize that referenced this issue Aug 19, 2024
Fixes MaterializeInc#28885

With this change, we might now get panics in other places where we make
assumptions about sinces/uppers that don't hold in read-only mode or
when there are concurrent coordinator-like processes.
aljoscha added a commit to aljoscha/materialize that referenced this issue Aug 27, 2024
Fixes MaterializeInc#28885

With this change, we might now get panics in other places where we make
assumptions about sinces/uppers that don't hold in read-only mode or
when there are concurrent coordinator-like processes.
@def- def- added the ci-flake label Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: something is broken ci-flake T-platform-v2 Theme: Platform v2
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants