Page MenuHomePhabricator

Dagster Pandas Guide Docs
ClosedPublic

Authored by themissinghlink on Feb 6 2020, 7:20 PM.

Details

Summary

This revision adds a guide for our dagster pandas integration. It also adds some examples which the docs links to. I don't do grammar good, so I would love feedback!

Test Plan

unit

Diff Detail

Repository
R1 dagster
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
schrockn added inline comments.Feb 6 2020, 9:02 PM
docs/sections/learn/guides/dagster_pandas/dagster_pandas.rst
44–53

This is about the underlying code, but StrictColumnsConstraint would seem to be better expressed with a StrictColumnsConstraint with an Any type. Also specifying types on PandasColumn is really common and I think we should have a top level dtype argument that constructs the ColumnTypeConstraint for you

86

doesn't need a description. description not exclusively for dagit too. generalized thing

90

let's only render the subset of this file relevant to what is being discussed

96–112

ahhhh... definitely lead with this. the super verbose example will turn off people.

SugarTripDataFrame = create_dagster_pandas_dataframe_type(
    name='SugarTripDataFrame',
    columns=[
        PandasColumn.integer_column('bike_id', min_value=0),
        PandasColumn.categorical_column('color', categories={'red', 'green', 'blue'}),
        PandasColumn.datetime_column('start_time', min_datetime=NOW),
        PandasColumn.datetime_column('end_time', min_datetime=NOW),
    ],
)

^-- this should be the first example IMO

This revision now requires changes to proceed.Feb 6 2020, 9:02 PM
schrockn added inline comments.Feb 6 2020, 9:03 PM
docs/sections/learn/guides/dagster_pandas/dagster_pandas.rst
44–53

Errr mistype.

StrictColumnsConstraint

could be

columns=[PandasColumn.exists('foo')]

112

Also confusing to introduce categorical as bespoke and then have a static method. users will be unable to do that without modifying the core library

max added inline comments.Feb 6 2020, 9:10 PM
docs/sections/learn/guides/dagster_pandas/dagster_pandas.rst
9

i think "extends the type system" is slightly the wrong framing -- what we've done here is written some custom types, i think the second sentence here and the following paragraph could be collapsed into one.

17

s/aside from/in addition to -- these are the standard fields for any dagster type

23

consider using the sphinx emphasize-lines directive, or rendering only some of the file

26

where is the dataset on which we're running this pipeline?

44

we should document these constraints and autogen api docs for them, then link here with the :py:class: directive

44–53

agree

69

hard to keep track of the terms of art here -- type constrants == schema validation, shape constraints == shape validation?

77

consider "custom" 🇬🇧

96

would it be crazy to lead with this?

docs/sections/learn/guides/dagster_pandas/dagster_pandas.rst
96–112

I am afraid that this would totally confuse people. I was trying to go for a piecemeal step by step document. Should I organize it this way:

Intro, schema validation, sugar, dataframe constraints, eventmetadata, custom validation.

I want to first introduce people to the concept of the PandasColumn before throwing compositions at them.

schrockn added inline comments.Feb 6 2020, 9:18 PM
docs/sections/learn/guides/dagster_pandas/dagster_pandas.rst
96–112

I would just use the sugar in a section called schema validation and lead with it. Trust me it is *way* easier to understand and, I would guess, will be the API that users will want to use 80-90% of the time.

columns=[
    PandasColumn.integer_column('bike_id'),
    PandasColumn.categorical_column('color',),
    PandasColumn.datetime_column('start_time'),
    PandasColumn.datetime_column('end_time'),
],
docs/sections/learn/guides/dagster_pandas/dagster_pandas.rst
112

AH that makes sense. Ok I will do that! Will let you know when it's ready. Thanks for the feedback, this is super helpful.

schrockn added inline comments.Feb 6 2020, 9:50 PM
docs/sections/learn/guides/dagster_pandas/dagster_pandas.rst
112

ya and don't do the categorical column since it is "custom"

themissinghlink marked 10 inline comments as done.
  • added simple tests for guide just to ensure they dont break on code changes
  • draft 2 of the documentation
  • made black fix

Ready for another round!

max added inline comments.Feb 10 2020, 10:38 PM
docs/sections/learn/guides/dagster_pandas/dagster_pandas.rst
17

I think it's confusing to introduce an API and then say there are a bunch of convenience constructors. Just progressively disclose the APIs as they appear in the examples -- I read this and then looked at the example and now I don't know what is the sugar and what is the underlying API.

20

this is the first example, not the next example

24

this name isn't meaningful to the user -- the pipeline doesn't seem to have anything to do with sugar

24

helpful to provide :linenos:

27

follow

32

maybe point to docs instead -- do we have docstrings on these static methods, and are they exposed in api docs? if so, link using sphinx's :py:class: machinery

91

i think these line numbers are off

examples/dagster_examples/dagster_pandas_guide/sugar_pipeline.py
17 ↗(On Diff #9416)

this is a little confusing, why would i want this constraint? (vs. max_datetime=NOW, which makes sense)

20 ↗(On Diff #9416)

i know it's a little late for this comment, but exists() is hard for me -- i immediately was like, is this exists or is this non_null, and went to check the code. consider (aliasing this with) any()

26 ↗(On Diff #9416)

This feels like a somewhat contrived setup and greatly increases the amount of space we need on the docs pages to display these examples -- vs. just defining the data type and then perhaps introducing some data on which to test it. i'd prefer a setup where only lines 12-23 were displayed in the docs. if you want to exercise this in a pipeline, perhaps a solid that read data in from csv or json and then emitted the data type would be the way to go -- this is more realistic and focuses attention on the parts that matter

schrockn added inline comments.Feb 10 2020, 11:26 PM
docs/sections/learn/guides/dagster_pandas/dagster_pandas.rst
17

👍🏻

24

To further reiterate. I believe that these APIs can and should be the 80% case. E.g. 80% of the time these will be sufficient to describe dataframe. They should be considered part of the core API

examples/dagster_examples/dagster_pandas_guide/sugar_pipeline.py
20 ↗(On Diff #9416)

any_column ?

schrockn requested changes to this revision.Feb 11 2020, 12:37 AM

q mgmt for max's feedback

This revision now requires changes to proceed.Feb 11 2020, 12:37 AM
  • implemented feedback and moved example data into csv

it's ready to rock!

schrockn requested changes to this revision.Feb 11 2020, 9:38 PM

almost there. thanks for working through this!

docs/sections/learn/guides/dagster_pandas/dagster_pandas.rst
18

"acts as containers" is a bit weird.

Suggestion:

provide a list of`PandasColumn` objects which specified column-level schema and constraints.

19

Cut "However, we provide a ton of convenience constructors."

42

"dataframe-level constraints"

They can be anything, not just "shape"

50–52

This example is still weird IMO because StrictColumnsConstraint is superfluous. The columns already do that.

80–81

The first three sentences could read as something like:

PandasColumn is user-pluggable with custom constraints. They can be constructed directly and passed a list of ColumnConstraints.

82

is it ColumnLevelConstraint or ColumnConstraint?

83

para graph when starting to talk about the example?

85

let's cut (These e-bikes are made of solid gold)

86

an error_description

91

All the columns that are not `

PandasColumn(
    'amount_paid', constraints=[ColumnTypeConstraint('int64'), DivisibleByFiveConstraint()]
),

`

are just noise. I would recommend eliminating them (and the dataframe constraints) and then highlighting the pandas column

94–95

Eliminate this sentence

This revision now requires changes to proceed.Feb 11 2020, 9:38 PM

Will address the others! Quick question on the example.

docs/sections/learn/guides/dagster_pandas/dagster_pandas.rst
50–52

Ha, you touched on a weird issue actually. Internally the API is totally checking if the columns exist twice. However, this also ensures extra columns aren't there wheras columns don't do that. However, I should totally be more explicit, I just mentioned the row example. Would being explicit about the strictness of that constraint be satisfactory?

prha added a comment.Feb 11 2020, 10:00 PM

Could just be me, but it feels jarring to switch so often between passive voice, 1st person and 2nd person.

docs/sections/learn/guides/dagster_pandas/dagster_pandas.rst
17

To create a custom dagster pandas type, use

19–21

Strike To further illustrate this idea....

For example, we can construct a custom dataframe type to represent a set of e-bike trips in the following way:

and show only the snippet for the type definition.

schrockn added inline comments.Feb 11 2020, 10:04 PM
docs/sections/learn/guides/dagster_pandas/dagster_pandas.rst
52

maybe let's hold off showcasing it until we figure out an api that doesn't seemingly duplicate information like this

docs/sections/learn/guides/dagster_pandas/dagster_pandas.rst
52

Aight. So just to be explicit, delete the whole section until we figure out a better API for dataframe constraints.

themissinghlink added a comment.EditedFeb 11 2020, 10:20 PM

Great points @prha. That was an artifact of the multiple rewrites, it def needed to be buttoned up. On it....Next time I should start this on google docs.....this is an awful forum for iterating on docs when they are in a naecent stage.

docs/sections/learn/guides/dagster_pandas/dagster_pandas.rst
19–21

Will def make it more succinct! However, I want to make sure we showcase how we hook those types into solids because there is some complexity there in that you can only hook them into input/output definitions and not as mypy compatible types.

schrockn added inline comments.Feb 11 2020, 10:20 PM
docs/sections/learn/guides/dagster_pandas/dagster_pandas.rst
52

No just omit the StrictColumnsConstraint per in-person convo

prha added inline comments.Feb 11 2020, 10:26 PM
docs/sections/learn/guides/dagster_pandas/dagster_pandas.rst
19–21

maybe show the two parts separately, then...

e.g.

For example, we can construct a custom dataframe type to represent a set of e-bike trips in the following way: ...

Once our custom data type is defined, we can use it as the type declaration for the inputs / outputs of our solid: ...

  • revised examples and doc to incorporate feedback
schrockn accepted this revision.Feb 12 2020, 12:09 AM

Great! Please heed final comments, especially on the first sentence (some of @max 's feedback was unaddressed) Great job! Fun to see this documented

docs/sections/learn/guides/dagster_pandas/dagster_pandas.rst
7–8

Per max's feedback, extending the dagster type system isn't really accurate.

Pandas is a broadly adopted library for data transformations. Dagster pandas is a library the provides the ability to easily express custom dataframe types that perform data validation" and so on and so forth

21

there's an unnecessary trailing comma on the first line

51

there's another unnecessary trailing comma in this code include

This revision is now accepted and ready to land.Feb 12 2020, 12:09 AM
  • added last batch of fixes to the documentation
This revision was automatically updated to reflect the committed changes.