Page MenuHomePhabricator

Initial pass at telemetry

Authored by catherinewu on Wed, Mar 11, 5:58 PM.



Initial pass at telemetry. TODO: probably rename user_id => instance_id; add guardrail to prevent pytest from generating logs

Test Plan

Added some tests

Diff Detail

R1 dagster
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
catherinewu marked an inline comment as done.Fri, Mar 13, 7:04 AM
catherinewu added inline comments.

fixed, thx


i was only tracking when the execute pipeline api ended and not when the cli command ended, but it makes sense to track the end for both. consolidated


i might be misunderstanding 'top level cli command', it's currently at dagster utils reset_telemetry_profile but happy to move elsewhere!


the utility for this throws an error when DAGSTER_HOME is not set -- I can modify that function to take in an optional bool to suppress the error and return None but that feels slightly messier? or i can add another function that does _dagster_home_if_set() instead? not sure


I think we do want to set user_id and enabled when telemetry_enabled is None?

catherinewu marked an inline comment as done.


catherinewu added inline comments.Fri, Mar 13, 8:17 AM

agreed, in progress...


agreed, in progress...


oh cool thats a lot better

alangenfeld added inline comments.Fri, Mar 13, 3:53 PM

i guess i would propose something like dagster utils enable_telemetry --reset

async request, cli

catherinewu marked an inline comment as done.

revert async stuff

write logs to $DAGSTER_HOME/log.txt instead of POST-ing immediately, update tests, add threading to handle file upload async when dagit is started up (with 3 retries), add redirect from =>, TODO: add more testing

oops, add month/day/year back into logs

update endpoint

nate added inline comments.Mon, Mar 16, 6:41 PM

I'm not sure if / what's needed here to clean up thread when serve_forever() is interrupted, @alangenfeld?


probably need os.path.expanduser here


it looks like start_server kicks this off in a thread, but this will only try to run once and then terminate. Should this be more of a while True and then ship logs up periodically? e.g. if dagit is a long-running server and stays running for weeks


maybe also put this URL in a DAGSTER_TELEMETRY_URL at the top of the file?


hmmm will this be a problem if the main process is continuing to write to the log file while this thread is uploading/removing?


I think it'd be nice to put this in a function at the very top of the file and document it, to make it super clear what's being logged

catherinewu marked 2 inline comments as done.


catherinewu marked an inline comment as done.Mon, Mar 16, 6:59 PM
catherinewu added inline comments.

my impression was that the daemon thread will die if serve_forever dies? I'm not sure if thats sufficient


ah i see, i can have it loop every 8 hours? i should also add logging for execute pipeline commands from the dagit ui


im not sure how robust we should make this right now, but one option is the move the file to log-process-upload.txt before the upload occurs so that logs that happen in the meantime can be written to log.txt?

alangenfeld added inline comments.Mon, Mar 16, 7:01 PM

since its a daemon thread it should die when the process exits


send log file every hour or when logs > 10MB, send log file directly (without parsing), use two files to improve robustness (log.txt and log-queue.txt)

nate added a comment.Tue, Mar 17, 12:21 AM

ok left a few comments. a couple of high level ones:

  • It's worth adding docstrings to most of the functions in to explain what they do, since our users are likely to closely inspect this file to understand our telemetry
  • I would try to add some tests covering the cases where the log files grow super large and ensure we handle appropriately; e.g. what happens if the dagster telemetry server is unavailable and the log grows to 2GB? Generally want to avoid the case where our logs become extremely large
  • Instead of only one staging file, we may want to do a logrotate style chunking of logs to avoid the failure case where we accidentally accumulate huge single files for upload; worst case we should have a large number of smaller files that need to be uploaded, and eventually we should handle ageing out old files (e.g. we generated a log file 2 months ago, uploads never succeeded, it's still on disk, let's just delete it)
  • Right now we only invoke upload_logs() from a thread in Dagit; might want to do the same for the dagster CLI. For example a user may never use Dagit and only use dagster pipeline run in which case they will accumulate a huge log file

I'd break out the inside of this elif into a function somehow and write tests against it, I worry about py2/3 compat with all the file handling here. May want to open all binary and use six.ensure_binary liberally


file reserved keyword


per discussion can we gzip encode to save network bandwidth? our events should compress nicely


'w' here, 'a' above - if this is to just touch the file should keep consistent

catherinewu marked 3 inline comments as done.Tue, Mar 17, 8:23 PM
catherinewu added inline comments.

this actually deletes the contents of the file (where as the 'a' touches the file), but i removed the whole thing thx to the new logger :D

fix test, add event id

nate added inline comments.Wed, Mar 18, 12:27 AM

maybe add a sleep to this while loop?


hmmm this last_run logic seems off given how you define last_run above—maybe I'm missing something?


and I didn't quite follow what in_progress does here?



nate added a comment.Wed, Mar 18, 12:28 AM

cool I think we're close! just a few last small comments, and also one high-level one - its worth doing a find-and-replace pass over everything to replace double quotes with single quotes just to keep consistent w/ the rest of the codebase

catherinewu marked 2 inline comments as done.Wed, Mar 18, 12:40 AM

adding a note that uplaoding on dagster pipeline run is a bit more complicated to support because

  1. we would need a locking mechanism in case dagit and dagster cli are trying to upload simultaneously
  2. dagster cli processes will tend to be more short lived

seems worth it eventually though


good call -- will do


I initialize last_run as a datetime.datetime object 2 hours in the past to make sure that _upload_logs runs the first time this if statement is checked ( - ( - datetime.timedelta(minutes=120)) > datetime.timedelta(minutes=60) == True)

Then, last_run is just updated per run and the if statement checks that 60 minutes has passed since the last run. does that sound right?


in_progress is just a sanity check to make sure that we never kick off two concurrent _upload_logs jobs. although would two dagit instances share the same $DAGSTER_HOME...that seems possibly bad

catherinewu marked an inline comment as done.


nate accepted this revision.Wed, Mar 18, 2:58 AM
This revision is now accepted and ready to land.Wed, Mar 18, 2:58 AM
catherinewu marked 8 inline comments as done.Wed, Mar 18, 5:10 AM
This revision was automatically updated to reflect the committed changes.