Scratched these together. Not totally done as TJ decided to get up earlier than is customary.
But here are four use cases that I think we should keep in mind when designing this
- Homogenous data lake. The "lakehouse" scenario. Greenfield, clean graph. Typing is consistent and correct. Want to minimze the number of places in the graph where we have to encode coordinate data.
- Heterogenous data lake scenario. This concocts a message where you have a single graph that does from a data lake (e.g. pyspark) to a data warehouse (dbt over snowflake) to another data lake (pandas). This is conceivable in a multi-team scenario.
- Config DSL. An engineer has setup a system where she allows users to modify the behavior of the pipeline via config only. The scenario here is that non-technical users can add things like filter operations in the input section of the config.
- Operational flexibility. An engineer has inherited a legacy pipeline where they can only debug in a deployed state. They *might* have a staging environment. However they cannot execute locally on their laptop nor do they (or maybe they cannot) sshell into their remote machine. Therefore what they want to do is continuosly redeploy code and parameterize with the config system to do their dev loop