Page MenuHomePhabricator

add summary table for asset keys
ClosedPublic

Authored by prha on Tue, Oct 13, 4:41 PM.

Details

Summary

This is separate from the asset key structure, which we can maintain using the same schema.

Will start reading/writing to the new summary table.

Users will need to run a migration script in order for all historical asset keys to be visible in the index page.

Test Plan

back compat test

Diff Detail

Repository
R1 dagster
Branch
prha/asset_table
Lint
Lint OK
Unit
No Unit Test Coverage

Event Timeline

prha requested review of this revision.Tue, Oct 13, 5:13 PM
prha added reviewers: max, alangenfeld, schrockn.
python_modules/dagster/dagster/core/storage/event_log/schema.py
21

hmm i dont know about this one

python_modules/libraries/dagster-postgres/dagster_postgres/event_log/event_log.py
102–106

theres gotta be a better way than this right?

python_modules/dagster/dagster/core/storage/event_log/schema.py
21

maybe a last updated timestamp, but count seems dubious to me

python_modules/libraries/dagster-postgres/dagster_postgres/event_log/event_log.py
106
python_modules/dagster/dagster/core/storage/event_log/schema.py
21

We enable deleting of event log records when we delete a run.

This is an attempt to make the bookkeeping accurate in case we delete the last record of a known asset key.

Short of maintaining a counter, we'd have to run a table-scanning distinct query upon run deletion.

The alternative is not try to do that book-keeping and allow via the UI to delete asset-keys from the summary table

python_modules/libraries/dagster-postgres/dagster_postgres/event_log/event_log.py
106

Wasn't sure if we were going to try to support older versions of postgres. honestly, the fact that some syntax is supported by some versions of postgres (> 9.5) and not others seems a pain.

prha planned changes to this revision.Wed, Oct 14, 3:46 PM

your queue - re: conversations around the index opt in

This revision now requires changes to proceed.Thu, Oct 15, 8:50 PM

build upon D4800, use secondary index to track asset key table migration

prha planned changes to this revision.Fri, Oct 16, 4:45 PM

add cli argument to iterate through list of reindex migrations

python_modules/dagster/dagster/core/storage/event_log/migration.py
36–40

this took 30s on the RDS test db, we worried about it bigger instances?

36–52

we should print out whats happening here - potentially use tqdm or whatever for the progress on asset key writes

python_modules/dagster/dagster/core/storage/event_log/sql_event_log.py
331–333

did you check this query at all?

rebase, add more logging to migration script

python_modules/dagster/dagster/core/storage/event_log/sql_event_log.py
331–333

yep, it uses the asset_key_idx

python_modules/dagster/dagster/cli/instance.py
61 ↗(On Diff #24037)

cannot be migrated -- if you intended to migrate a persistent instance, etc.

python_modules/dagster/dagster/core/storage/event_log/base.py
68 ↗(On Diff #24037)

helpful to have docs on the arguments so that ppl can implement this

python_modules/dagster/dagster/core/storage/event_log/migration.py
36–40

there are other ways to do this that would make sure we avoid locking the db up. e.g.,

  • we could select count(1) on the rows (not attempting a uniqueness/group by query) for which we want to insert asset_key entries, to get a count of rows to display to the user.
  • then we can page over the rows to index (ordering by id) and do batch upserts

that will be computationally more expensive but it moves the computation to the migration process and out of the db

python_modules/dagster/dagster/core/storage/event_log/schema.py
21

i think we should have this on all our db objects tbh

python_modules/dagster/dagster/core/storage/event_log/sql_event_log.py
46

hm what happens if the migration fails halfway -- we need to drop the index right?

331–333

would it be insane to have a test that made assertions about the output of explain on this query

remove test for removed migration

python_modules/dagster/dagster/core/storage/event_log/migration.py
36–40

I think having a long-running read should not lock up the db (selects should only block alter/drop/truncate queries).

python_modules/dagster/dagster/core/storage/event_log/sql_event_log.py
46

No, we don't mark the migration as successful unless the script runs to completion.

I think this could stand to use more tests around the different behavior and state transitions

This revision now requires changes to proceed.Tue, Oct 20, 10:55 PM
python_modules/dagster/dagster/core/storage/event_log/migration.py
36–40

consider the select distinct that locked up reads from dwall's db

python_modules/dagster/dagster/core/storage/event_log/sql_event_log.py
46

but what state is the index left in? is there the possibility of duplicate entries?

python_modules/dagster/dagster/core/storage/event_log/migration.py
36–40

I think the select distinct had more to do with blocking on the graphql webserver, rather than at the DB level. The query itself should not block other reads.

python_modules/dagster/dagster/core/storage/event_log/sql_event_log.py
46

I can add a note about the importance of idempotency for the migrations. But the pattern of checking if the secondary index is enabled before reads should mean that it being in an incomplete state should not affect anything because there will be no reads from it.

python_modules/dagster/dagster/core/storage/event_log/migration.py
36–40

ya some notes on the webserver problem here https://elementl.quip.com/zZtAA2HE2mFN/Dagit-Webserver-Throughput

python_modules/dagster/dagster/core/storage/event_log/migration.py
36–40

yep that makes sense

python_modules/dagster/dagster/core/storage/event_log/sql_event_log.py
46

so if i run it and it fails halfway, then i run it again, will it drop the existing values? i don't see that here

python_modules/dagster/dagster/core/storage/event_log/sql_event_log.py
46

It will not drop the existing values... But there is a unique constraint on asset_keys so you will not get duplicate values.

We also do the asset key table bookkeeping when we call delete_events on the event log storage, so I don't think you would end up with stale values that appear in the asset key table that are not in the event log table without manually dropping rows from the event_log.

ya i think the unique constraint gives us some comfortable wiggle room on corner cases - I will go ahead and green light this but follow up with @max on any last concerns

This revision is now accepted and ready to land.Wed, Oct 21, 7:44 PM
This revision was automatically updated to reflect the committed changes.