Initial pass at telemetry. TODO: probably rename user_id => instance_id; add guardrail to prevent pytest from generating logs
i was only tracking when the execute pipeline api ended and not when the cli command ended, but it makes sense to track the end for both. consolidated
i might be misunderstanding 'top level cli command', it's currently at dagster utils reset_telemetry_profile but happy to move elsewhere!
the utility for this throws an error when DAGSTER_HOME is not set -- I can modify that function to take in an optional bool to suppress the error and return None but that feels slightly messier? or i can add another function that does _dagster_home_if_set() instead? not sure
I think we do want to set user_id and enabled when telemetry_enabled is None?
write logs to $DAGSTER_HOME/log.txt instead of POST-ing immediately, update tests, add threading to handle file upload async when dagit is started up (with 3 retries), add redirect from telemetry.dagster.io => telemetry.elementl.dev, TODO: add more testing
I'm not sure if / what's needed here to clean up thread when serve_forever() is interrupted, @alangenfeld?
probably need os.path.expanduser here
it looks like start_server kicks this off in a thread, but this will only try to run once and then terminate. Should this be more of a while True and then ship logs up periodically? e.g. if dagit is a long-running server and stays running for weeks
maybe also put this URL in a DAGSTER_TELEMETRY_URL at the top of the file?
hmmm will this be a problem if the main process is continuing to write to the log file while this thread is uploading/removing?
I think it'd be nice to put this in a function at the very top of the file and document it, to make it super clear what's being logged
my impression was that the daemon thread will die if serve_forever dies? I'm not sure if thats sufficient
ah i see, i can have it loop every 8 hours? i should also add logging for execute pipeline commands from the dagit ui
im not sure how robust we should make this right now, but one option is the move the file to log-process-upload.txt before the upload occurs so that logs that happen in the meantime can be written to log.txt?
ok left a few comments. a couple of high level ones:
- It's worth adding docstrings to most of the functions in telemetry.py to explain what they do, since our users are likely to closely inspect this file to understand our telemetry
- I would try to add some tests covering the cases where the log files grow super large and ensure we handle appropriately; e.g. what happens if the dagster telemetry server is unavailable and the log grows to 2GB? Generally want to avoid the case where our logs become extremely large
- Instead of only one staging file, we may want to do a logrotate style chunking of logs to avoid the failure case where we accidentally accumulate huge single files for upload; worst case we should have a large number of smaller files that need to be uploaded, and eventually we should handle ageing out old files (e.g. we generated a log file 2 months ago, uploads never succeeded, it's still on disk, let's just delete it)
- Right now we only invoke upload_logs() from a thread in Dagit; might want to do the same for the dagster CLI. For example a user may never use Dagit and only use dagster pipeline run in which case they will accumulate a huge log file
I'd break out the inside of this elif into a function somehow and write tests against it, I worry about py2/3 compat with all the file handling here. May want to open all binary and use six.ensure_binary liberally
file reserved keyword
per discussion can we gzip encode to save network bandwidth? our events should compress nicely
'w' here, 'a' above - if this is to just touch the file should keep consistent
cool I think we're close! just a few last small comments, and also one high-level one - its worth doing a find-and-replace pass over everything to replace double quotes with single quotes just to keep consistent w/ the rest of the codebase
adding a note that uplaoding on dagster pipeline run is a bit more complicated to support because
- we would need a locking mechanism in case dagit and dagster cli are trying to upload simultaneously
- dagster cli processes will tend to be more short lived
seems worth it eventually though
good call -- will do
I initialize last_run as a datetime.datetime object 2 hours in the past to make sure that _upload_logs runs the first time this if statement is checked (datetime.datetime.now() - (datetime.datetime.now() - datetime.timedelta(minutes=120)) > datetime.timedelta(minutes=60) == True)
Then, last_run is just updated per run and the if statement checks that 60 minutes has passed since the last run. does that sound right?
in_progress is just a sanity check to make sure that we never kick off two concurrent _upload_logs jobs. although would two dagit instances share the same $DAGSTER_HOME...that seems possibly bad