Page MenuHomePhabricator

make dagster pandas input/output schema more flexible

Authored by themissinghlink on Feb 17 2020, 7:21 PM.



Currently, the input/output selector schemas are super strict and only allow for configuring sep, index params. I would like to turn more knobs. Also, this coupling is not great because if pandas makes a backwards incompatible change to their read_csv/to_csv API, our builds break. This pushes the work to client code so that our API isn't couple too tightly to pandas.

Test Plan


Diff Detail

R1 dagster
add-io-configs-to-gcs (branched from master)
Lint OK
No Unit Test Coverage

Event Timeline

themissinghlink edited the summary of this revision. (Show Details)
max added a comment.Feb 17 2020, 11:19 PM

i would have biased towards going in the other direction and strongly typing the whole pandas API rather than reducing the surface we cover. one benefit of strongly-typed config is catching errors early. even if we do go the other direction, i don't like the introduction of a new parameter name that doesn't align with the underlying API -- why serialization_options and not kwargs? (i admit that taking this position to its extreme would replace path with filepath_or_buffer, which, maybe, is the wrong thing to do)

The problem with strongly typing the config is that while yes you get to catch errors early, you also create a really brittle API. Pandas API changes pretty frequently and new kwargs are added all the time. What happens if the user uses a version of rreadcsv that differs from the version we made a strongly typed config of? Or worse, what if a version of pandas makes a backward incompatible change? Our current version is the least opinionated about implementation details and if the user has a pin to a specific version of pandas and want to be opinionated, they can make their own. Right now I feel more people would make their own rather than use the default which feels wrong.

Re naming. I'm not married to serialization_options, however, kwargs seems a bit confusing right? Here is an example. What about read_csv_kwargs and to_csv_kwargs just so we are being hella explicit with what's happening here?

'df_as_config': {
    'inputs': {
        'df': {
            'csv': {
                'path': script_relative_path('num_table.txt'),
                'kwargs': {'sep': str('\t')},

Regardless of whether pandas' API changes in the future, we need to be compatible with all versions of pandas which makes it difficult to have a strictly typed schema right? I mean it's a tradeoff, but I don't know if we want to tie ourselves to maintaining it.

  • renamed to more explicit val
  • rebasing to pick up buildkite changes.

Hey all wanna get resolution on this. What do we wanna do here? @nate ?

themissinghlink added a comment.EditedMar 4 2020, 10:48 PM

FWIW I am fine with moving towards something more structured if the need arises. I would just really like something because this current surface area is too restrictive to be useful.

max added a comment.Mar 10 2020, 7:11 PM

I don't think a Permissive approach is a good idea. instead of explicitly breaking when the API changes, we'll be in a place where configs have an implicit dependency on the installed version of pandas

max accepted this revision.Mar 20 2020, 8:58 PM

I don't want to block this :)

This revision is now accepted and ready to land.Mar 20 2020, 8:58 PM

nah, I thought about this and don't want to explode API surface area until we get a request from a user. Let's collect some data and revisit at some point.