Page MenuHomePhabricator

include system framing message in user errors
ClosedPublic

Authored by sandyryza on Jan 15 2021, 6:46 PM.

Details

Summary

Old:

2021-01-15 13:28:19 - dagster - ERROR - my_pipeline - d4216bdf-c543-451c-94d4-cd63255ace84 - 68005 - take_data - STEP_FAILURE - Execution of step "take_data" failed.

ValueError: abc

Stack Trace:
  File "/Users/sryza/dagster/python_modules/dagster/dagster/core/errors.py", line 182, in user_code_error_boundary
    yield
  File "/Users/sryza/dagster/python_modules/dagster/dagster/core/execution/plan/inputs.py", line 346, in _load_input_with_input_manager
    value = root_input_manager.load_input(context)
  File "/Users/sryza/dagster/python_modules/dagster/dagster_tests/core_tests/test_errors.py", line 92, in load_input
    raise ValueError("abc")

New

2021-01-15 13:28:19 - dagster - ERROR - my_pipeline - d4216bdf-c543-451c-94d4-cd63255ace84 - 68005 - take_data - STEP_FAILURE - Execution of step "take_data" failed.

Error occurred during the loading of input "input1" of step "take_data":

ValueError: abc

Stack Trace:
  File "/Users/sryza/dagster/python_modules/dagster/dagster/core/errors.py", line 182, in user_code_error_boundary
    yield
  File "/Users/sryza/dagster/python_modules/dagster/dagster/core/execution/plan/inputs.py", line 346, in _load_input_with_input_manager
    value = root_input_manager.load_input(context)
  File "/Users/sryza/dagster/python_modules/dagster/dagster_tests/core_tests/test_errors.py", line 92, in load_input
    raise ValueError("abc")
Test Plan

bk

Diff Detail

Repository
R1 dagster
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Harbormaster returned this revision to the author for changes because remote builds failed.Jan 15 2021, 7:05 PM
Harbormaster failed remote builds in B24497: Diff 29831!
sandyryza retitled this revision from indent error framing to include system framing message in user errors.Jan 15 2021, 9:30 PM
sandyryza edited the summary of this revision. (Show Details)

what led away from the approach in https://dagster.phacility.com/D5784 ? Just updating test call sites?

I think just raising the outer error and improving our display / printing of it makes the most sense to me. Adding another field to StepFailureData could make sense too.

python_modules/dagster/dagster/core/execution/plan/execute_plan.py
247–270

messing with the serialization of the error this way doesn't sit well with me here. Also worth thinking about how this shows up in dagit, and what flexibility we have to do even better displays in the future depending on how we model this.

247–270

just spitballing - even though the stack trace for DagsterUserCodeExecutionError is not generally useful for users, I could imagine it being useful in a bug report submitted to us. In that line of thinking capturing it and just not printing it in stdout / hiding behind a click in dagit might be a better path

what led away from the approach in https://dagster.phacility.com/D5784 ? Just updating test call sites?
I think just raising the outer error and improving our display / printing of it makes the most sense to me.

I think it ultimately comes down to the comparison here: https://github.com/dagster-io/dagster/issues/3497#issuecomment-755784025. The approach in D5784 adds a big distracting section to user error messages that users won't care about in 99% of cases.

Do you have something in mind for how we could improve the display?

Adding another field to StepFailureData could make sense too.

I'll give that a shot.

Do you have something in mind for how we could improve the display?

I was thinking something like this (which also fixes the cause connecting message and adds the missing class name).

(dagenv38) ~/dagster:err$ git diff
diff --git a/python_modules/dagster/dagster/utils/error.py b/python_modules/dagster/dagster/utils/error.py
index edbc66d02..911c95f1b 100644
--- a/python_modules/dagster/dagster/utils/error.py
+++ b/python_modules/dagster/dagster/utils/error.py
@@ -13,14 +13,23 @@ def __new__(cls, message, stack, cls_name, cause=None):
         return super(SerializableErrorInfo, cls).__new__(cls, message, stack, cls_name, cause)

     def to_string(self):
-        stack_str = "\nStack Trace:\n" + "".join(self.stack) if self.stack else ""
-        cause_str = (
-            "\nThe above exception was the direct cause of the following exception:\n"
-            + self.cause.to_string()
-            if self.cause
-            else ""
-        )
-        return "{err.message}{stack}{cause}".format(err=self, stack=stack_str, cause=cause_str)
+        stack_str = ""
+        cause_str = ""
+
+        # Special case our wrapper exception around user code
+        if self.cls_name == "DagsterUserCodeExecutionError":
+            # omit stack and cause framing
+            cause_str = "\n\n" + self.cause.to_string() if self.cause else ""
+        else:
+            stack_str = "\nStack Trace:\n" + "".join(self.stack) if self.stack else ""
+            cause_str = (
+                "\nThe above exception was caused by the following exception:\n"
+                + self.cause.to_string()
+                if self.cause
+                else ""
+            )
+
+        return f"{self.cls_name}: {self.message}{stack_str}{cause_str}"


 def _serializable_error_info_from_tb(tb):

I really like the idea of changing [1] to throw the outer exception since that most accurately models what is happening in the code. Then handling the user display in the display code, in to_string and https://dagster.phacility.com/source/dagster/browse/master/js_modules/dagit/src/PythonErrorInfo.tsx$37. Mucking with the data model to get the display right here makes me uneasy.

python_modules/dagster/dagster/core/execution/plan/execute_plan.py
246–250

[1]

queue management - we spoke in slack about changing background to a full exception among other things

This revision now requires changes to proceed.Jan 25 2021, 11:01 PM

@alangenfeld here's a version that stores bth the full error and the user error on the StepFailureData. One thing to think about is that we're basically doubling the payload size of these events. Something to be concerned about?

I dont think the double persisting is that problematic - but I do think the idea of capturing the comment at [2] in an enum so that we can echo it in dagit is a powerful idea

  • RUNTIME_ERROR -> something wrong happened in framework land discovered only at runtime, usually solvable by user
  • USER_CODE_ERROR -> the outer error frames the user error found in cause, we can key off this in error_display_string and dagit
  • UNEXPECTED_ERROR -> not a dagster error, encourage user to file an issue since something went wrong in a bad way that we need to fix (or wrap)

I am happy to commandeer this if you want given the churn i have imposed. That said, I am happy with where the iteration has led us to.

python_modules/dagster/dagster/core/execution/plan/execute_plan.py
141–170

[2]

python_modules/dagster/dagster/core/execution/plan/objects.py
112

One thing to think about is that we're basically doubling the payload size of these events. Something to be concerned about?

Ideas

  • we could add an arg so that we don't persist the full chain for the "user error"
  • have some sort of scheme that allows us to know the top error in the chain is wrapper
    • outer_error_wraps_user_code_error: bool / outer_error_is_contextualizing: bool
    • dagster_error_type: DagsterErrorType(Enum) - enum aligns with [2]

but I do think the idea of capturing the comment at [2] in an enum so that we can echo it in dagit is a powerful idea

This seems like a good idea. Do you think it's necessary for the rest of this to go in?

This seems like a good idea. Do you think it's necessary for the rest of this to go in?

Actually, nevermind. I see how it's intertwined with what's going on here. Will give it a shot.

@alangenfeld I think this is ready for another look. Sorry for all the phabricator thrash. I ended up with a slightly different enum than the one you proposed. Let me know if you want to discuss.

awesome - super happy with where this ended up. One more round trip on the naming ErrorSetting since its going in the DB and would be costly to change. Request review if you have a case for why the current names are best.

python_modules/dagster/dagster/core/execution/plan/execute_plan.py
257–269

[1] lets maybe have the setting/context/source be explicitly passed in at each call site. Maybe leave comments here that interrupt & unexpected could be improved

python_modules/dagster/dagster/core/execution/plan/objects.py
60–65

we can iteratively add the other ones so I am fine starting small but [1]

61

Setting makes me think on / off type of setting. Maybe ErrorType, ErrorContext, DagsterErrorType, ErrorSource

65

I think USER_CODE_ERROR is more better - could imagine "USER_ACTION_ERROR" or something if users can take actions on in flight executions someday

96–97

check for cause here just to be safe in the if condition

This revision now requires changes to proceed.Mon, Feb 15, 6:35 PM
python_modules/dagster/dagster/core/execution/plan/objects.py
61

Yeah I feel you. Leary of "type", because errors already have types. Leary of "context" because most types in Dagster that end in Context are context objects that get passed in to user functions. I like source.

65

Definitely.

quite the journey but very excited about this - thanks for going through all the iteration.

Just take care of the one last inline before landing

finallydone

python_modules/dagster/dagster/core/execution/plan/objects.py
75

error_setting -> error_source

This revision is now accepted and ready to land.Wed, Feb 24, 8:04 PM

Woohoo! @alangenfeld thanks for your reviews through all the revs