Page MenuHomePhabricator

make dagster pandas input/output schema more flexible
AcceptedPublic

Authored by themissinghlink on Feb 17 2020, 7:21 PM.

Details

Reviewers
max
nate
Summary

Currently, the input/output selector schemas are super strict and only allow for configuring sep, index params. I would like to turn more knobs. Also, this coupling is not great because if pandas makes a backwards incompatible change to their read_csv/to_csv API, our builds break. This pushes the work to client code so that our API isn't couple too tightly to pandas.

Test Plan

unit

Diff Detail

Repository
R1 dagster
Branch
add-io-configs-to-gcs (branched from master)
Lint
Lint OK
Unit
No Unit Test Coverage

Event Timeline

themissinghlink edited the summary of this revision. (Show Details)
max added a comment.Feb 17 2020, 11:19 PM

i would have biased towards going in the other direction and strongly typing the whole pandas API rather than reducing the surface we cover. one benefit of strongly-typed config is catching errors early. even if we do go the other direction, i don't like the introduction of a new parameter name that doesn't align with the underlying API -- why serialization_options and not kwargs? (i admit that taking this position to its extreme would replace path with filepath_or_buffer, which, maybe, is the wrong thing to do)

The problem with strongly typing the config is that while yes you get to catch errors early, you also create a really brittle API. Pandas API changes pretty frequently and new kwargs are added all the time. What happens if the user uses a version of rreadcsv that differs from the version we made a strongly typed config of? Or worse, what if a version of pandas makes a backward incompatible change? Our current version is the least opinionated about implementation details and if the user has a pin to a specific version of pandas and want to be opinionated, they can make their own. Right now I feel more people would make their own rather than use the default which feels wrong.

Re naming. I'm not married to serialization_options, however, kwargs seems a bit confusing right? Here is an example. What about read_csv_kwargs and to_csv_kwargs just so we are being hella explicit with what's happening here?

'df_as_config': {
    'inputs': {
        'df': {
            'csv': {
                'path': script_relative_path('num_table.txt'),
                'kwargs': {'sep': str('\t')},
            }
        }
    }
}

Regardless of whether pandas' API changes in the future, we need to be compatible with all versions of pandas which makes it difficult to have a strictly typed schema right? I mean it's a tradeoff, but I don't know if we want to tie ourselves to maintaining it.

  • renamed to more explicit val
  • rebasing to pick up buildkite changes.

Hey all wanna get resolution on this. What do we wanna do here? @nate ?

themissinghlink added a comment.EditedWed, Mar 4, 10:48 PM

FWIW I am fine with moving towards something more structured if the need arises. I would just really like something because this current surface area is too restrictive to be useful.

max added a comment.Tue, Mar 10, 7:11 PM

I don't think a Permissive approach is a good idea. instead of explicitly breaking when the API changes, we'll be in a place where configs have an implicit dependency on the installed version of pandas

max accepted this revision.Fri, Mar 20, 8:58 PM

I don't want to block this :)

This revision is now accepted and ready to land.Fri, Mar 20, 8:58 PM