Page MenuHomeElementl

Different scenarios to consider when doing input/output management

Authored by schrockn on Nov 30 2020, 2:31 PM.



Scratched these together. Not totally done as TJ decided to get up earlier than is customary.

But here are four use cases that I think we should keep in mind when designing this

  1. Homogenous data lake. The "lakehouse" scenario. Greenfield, clean graph. Typing is consistent and correct. Want to minimze the number of places in the graph where we have to encode coordinate data.
  1. Heterogenous data lake scenario. This concocts a message where you have a single graph that does from a data lake (e.g. pyspark) to a data warehouse (dbt over snowflake) to another data lake (pandas). This is conceivable in a multi-team scenario.
  1. Config DSL. An engineer has setup a system where she allows users to modify the behavior of the pipeline via config only. The scenario here is that non-technical users can add things like filter operations in the input section of the config.
  1. Operational flexibility. An engineer has inherited a legacy pipeline where they can only debug in a deployed state. They *might* have a staging environment. However they cannot execute locally on their laptop nor do they (or maybe they cannot) sshell into their remote machine. Therefore what they want to do is continuosly redeploy code and parameterize with the config system to do their dev loop
Test Plan


Diff Detail

R1 dagster
Lint Passed
No Test Coverage

Event Timeline

These are awesome. I'm working on rebasing this on top of my InputManager diff to see how cleanly they fit.

One thing to think about: In the operational flexibility case, it's likely that the same file is used by multiple solids in the pipeline. At least ideally, when overriding the external_s3_path, they'd only need to do it in one place and have that carry over to everywhere it's used.