Page MenuHomePhabricator

Add pandas version agnostic type checking constraint.
ClosedPublic

Authored by themissinghlink on Apr 2 2020, 10:49 PM.

Details

Summary

It was expressed that ColumnTypeConstraints need to support pandas 1.0.0 dtypes. While I could totally just keep adding dtypes to a never ending set of expected dtypes, I think it makes sense to migrate all of this to the pandas core API. However, this doesn't play well with the ColumnGeneralTypeConstraint and also the core API doesn't support datetime types so we still need to keep the original constraint.

This revision introduces a ColumnGeneralTypeConstraint which performs pandas version agnostic type checking so we don't have to worry about dtype version support.

Issues: https://github.com/dagster-io/dagster/issues/2317, https://github.com/dagster-io/dagster/issues/2344, https://github.com/dagster-io/dagster/issues/2338

Test Plan

unit

Diff Detail

Repository
R1 dagster
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

  • went from Dtypes to types
themissinghlink edited the summary of this revision. (Show Details)Apr 2 2020, 11:32 PM
themissinghlink added a reviewer: alangenfeld.
themissinghlink edited the summary of this revision. (Show Details)
themissinghlink edited the summary of this revision. (Show Details)Apr 2 2020, 11:37 PM
themissinghlink added a reviewer: max.

lets take some time to think this one through since it appears doing it in a backwards compat safe and not confusing way may be tricky

python_modules/libraries/dagster-pandas/dagster_pandas/constraints.py
141–142

hmm ColumnGeneralTypeConstraint vs ColumnTypeConstraint seems too close - something seems off here

alangenfeld added inline comments.Apr 2 2020, 11:51 PM
python_modules/libraries/dagster-pandas/dagster_pandas/constraints.py
153–154

did you consider an approach which maps to the is_XXX_dtype function for the relevant set of expected_pandas_dtypes strings here?

max added inline comments.Apr 3 2020, 12:45 AM
python_modules/libraries/dagster-pandas/dagster_pandas/constraints.py
141–142

yeah i feel like this could be collapsed into one constraint type since expected_pandas_dtypes is already Union[str, Set[str]], we would just expand the set of allowable strs for expcted_pandas_dtypes

142–146

is it a good idea to keep this open vs. constraining the possible string values here

163

It would be nice to know what dtype we observed as well as what we were expecting

python_modules/libraries/dagster-pandas/dagster_pandas/constraints.py
141–142

So I can totally expand the set of allowable strs for expected pandas dtypes however, every time pandas adds a dtype or deprecates one, we are going to have to update our code. Using is_XXX_dtype ensures we aren't coupling our code with pandas' understanding of dtypes. The only reason I am keeping this constraint in is because of datetime types which isn't yet translated by pandas.

142–146

Yeah I can setup an enum for expected dtypes, but want to first figure the other stuff out first.

153–154

I did but the whole point of using is_XXX_dtype is so that we can push the responsibility of translating dtypes and python types to pandas. If we start mapping, then that advantage is lost.

alangenfeld added inline comments.Apr 3 2020, 3:43 PM
python_modules/libraries/dagster-pandas/dagster_pandas/constraints.py
153–154

just to clarify - i meant if we get "string" here we would map to evoking is_string_type instead of the current style of check - but dont know the pandas dtype stuff well enough to know if thats bonkers or not

python_modules/libraries/dagster-pandas/dagster_pandas/constraints.py
153–154

Ah I see. So the issue is there are overlapping terms of art. Pandas also defines a "string" as a dtype. I fear trying to mix all of these different "type" terms into one constraint might get overly confusing. This is why I went with a "dumb" constraint which lets the user configure what dtypes they expect. And a "smart" constraint which lets the user just say, "look I want a column of string-like data, I don't want to deal with figuring out all of the possible dtypes that could be".

alangenfeld added inline comments.Apr 3 2020, 7:16 PM
python_modules/libraries/dagster-pandas/dagster_pandas/constraints.py
153–154

hmm how bout something like

ColumnTypeFnConstraint(is_string_df) # takes Callable[[str], bool]

and

ColumnTypeInSetConstraint(expected_dtypes) # takes Set[str]

( keep ColumnTypeConstraint til 0.8.0 for backcompat )

python_modules/libraries/dagster-pandas/dagster_pandas/constraints.py
153–154

Sure that is fine. Should we indicate via comment that we plan on deprecating this? Do we have a depreciation strategy? Is it overkill?

  • went from Dtypes to types
  • fix changes
  • added new constraints and fixed tests
  • added tests and switched to new constraints.
Harbormaster failed remote builds in B9220: Diff 11402!
  • fixed tests by getting rid of typechecconstraint
alangenfeld added inline comments.Apr 3 2020, 10:38 PM
python_modules/libraries/dagster-pandas/dagster_pandas/constraints.py
153–154

I go back and forth on if ColumnTypeInSetConstraint is worth it to do the deprecation dance or if we should just stick with ColumnTypeConstraint

should we be verbose and do ColumnTypeFunctionConstraint?

curious for other opinions

157

this error message seems rough it would be - is_string_dtype must evaluate to True for column dtypes: int ? seems like we should include the word "received"

164–193

these error message should also be more clear as to what was received as per maxs point

python_modules/libraries/dagster-pandas/dagster_pandas/constraints.py
153–154

I’m fine with doing the depreciation dance. But this lib is super new and this is a less frequently used lower level API which is why I am also okay with being a little loose and deleting it.

RE verbose, what about ‘ColumnPandasDTypeFunctionConstraint‘ ?

That makes it clear, we can also add a comment?

alangenfeld added inline comments.Apr 3 2020, 11:24 PM
python_modules/libraries/dagster-pandas/dagster_pandas/constraints.py
153–154

ya since the public APIs are mostly the PandasColumn static methods it might fine to just rename in place since its unlikely to hit many / any people

ColumnPandasDTypeFunctionConstraint

I think pandas is probably redundant since this in a pandas library but DTypeFunction and DTypeInSet are about as clear as possible

max added inline comments.Apr 5 2020, 4:14 AM
python_modules/libraries/dagster-pandas/dagster_pandas/constraints.py
142

my only real issue with all of this is that it's undocumented

153

received

173

received

  • got rid of ColumnTypeConstraint, renamed constraints to be more clear, and cleaned up dagit rendering
  • cleaned up error messages
max accepted this revision.Apr 7 2020, 9:48 PM

I don't want to block this on docstrings, but I really wish we had docstrings.

This revision is now accepted and ready to land.Apr 7 2020, 9:48 PM
themissinghlink added a comment.EditedApr 7 2020, 9:53 PM

@max Is it okay if I address that in a follow up diff? For posterity I am tracking this in an issue which I should get out soon. https://github.com/dagster-io/dagster/issues/2353