Page MenuHomePhabricator

[composition] improve output node misuse errors

Authored by alangenfeld on Oct 23 2019, 4:14 PM.



From situation hit during ... "user research", make it so InvokedSolidOutputHandle does not destructure accidentally in to two strings by making it an object and overriding methods to provide good error messages.

Test Plan

new test cases

Diff Detail

R1 dagster
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

alangenfeld created this revision.Oct 23 2019, 4:14 PM

Hmmmm...I get the use case of avoiding the accidental destructuring of tuples but I wonder if we are also imposing too much rigidity as a result. I forsee many use cases (specifically in ML) where a common pattern is to return a tuple and have it destructured so that pieces can be passed into different parts of the control flow. Here is an example. Let's say I have a solid that downloads a dataset and throws it into a dataframe. In the simplest use case, I want to split that dataframe into a test set and a training set. These functions generally look like the following:

from sklearn import fancy_sk_learn_function

def split_test_train(training_set):
    X, y = training_set[[FEATURE_COLUMNS]], training_set[LABEL_COLUMN]
    x_train, y_train, x_test, y_test = fancy_sk_learn_function(training_set)
    return x_train, y_train, x_test, y_test

The training sets (x_train, y_train) are then fed into a training function whereas the test sets (x_test, y_test) are thrown into a model evaluation function. This gets even more complicated when you further split out your training set into k randomized cross validation datasets. I don't have a very clear mental model for what solids look like when we remove the ability to destructure or at the very least __getitem__.

Implementation wise, this looks legit!

alangenfeld added a comment.EditedOct 29 2019, 2:58 PM

a common pattern is to return a tuple and have it destructured so that pieces can be passed into different parts of the control flow

So this doesn't prevent that - this is just catching an error where someone could be trying to do it incorrectly. The complexity here comes from how "composition" functions work (@pipeline, @composite_solid`) where we use a python function to capture the DAG vs. regular soilds where the body of the function does the compute. This change only affects composition functions, not regular solids.

So before this diff, this would cause an error:

def fancy_sk_learn_function(_) -> tuple:
   # do stuff
   return (x_train, x_test)

def split_test_train():
      # in composition functions, solid invocations return [1] which we track and use to wire 
      # up the dag. Since it happened to be a tuple this would work but our tracking would get strings 
      # instead of InvokedSolidOutputHandle and throw a confusing error.
      x_train, x_test = fancy_sk_learn_function(training_set)

The fix to be able to send different pieces of information through the pipeline is to use MultipleOutputs or yield multiple Outputs and align the output defs accordingly

To correct:

@solid(output_defs=[OutputDefinition(Train, 'x_train'), OutputDefinition(Test, 'x_test')])
def fancy_sk_learn_function(_):
   # do stuff
   yield Output(x_train, 'x_train')
   yield Output(x_test, 'x_test')

def split_test_train():
      # all gravy, we return a namedtuple of InvokedSoildOutputHandles when a solid
      # is defined to have multiple outputs
      x_train, x_test = fancy_sk_learn_function(training_set)


Got it that makes sense! That should definitely work. Thanks for taking the time to explain!

This revision is now accepted and ready to land.Oct 29 2019, 3:40 PM
nate added a subscriber: nate.Oct 29 2019, 5:44 PM
nate added inline comments.

"Consider yielding multiple outputs"? to make it clear we're talking about the Output object?

This revision was automatically updated to reflect the committed changes.