-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make telemetry opt-out #715
Comments
Great write up, and it's the decision I hoped for a very long time the team would come to! Excited to see this (hopefully) move forward. |
Fully agree with @tynandebold - very happy to see this, and it's a great writeup 👍 Just to understand, the move from hashing username/project name to UUIDs is because it's even more anonymous? Since, in theory anyway, one could brute-force hashed data to reverse engineer the unhashed username? |
Correct. Our hash function is public and fast, so potentially the data is vulnerable to dictionary attacks. |
Great summary and proposal, @astrojuanlu! I fully agree that telemetry should be integrated into Kedro, rather than being a standalone library. Additionally, I want to point out that telemetry is currently not working with the |
What's the purpose of each of these environment variables? Can't we just add 1? |
On the other hand, there might be people that care specifically about Kedro telemetry, or might choose the explicit setting for whatever reason, hence having Inspiration:
Either Either |
Is there anything left to do for this ticket? @astrojuanlu @DimedS |
Why do we even want telemetry?
The Kedro team uses telemetry to understand product usage and make data-informed decisions that benefit our users. For example, we were able to determine that certain CLI subcommands had very little usage kedro-org/kedro#1293, kedro-org/kedro#3750. The alternatives would have been
Therefore having telemetry is a low-cost way for us to keep improving Kedro for everyone.
What is wrong with the current telemetry collection process?
At the moment the telemetry collection has two layers of opt-in:
kedro-telemetry
needs to be installed. At the moment this happens because we're introducing it in the requirements of our starters:https://github.com/kedro-org/kedro/blob/bf536d4029d94bb318848b150be78d23fca44fb4/kedro/templates/project/%7B%7B%20cookiecutter.repo_name%20%7D%7D/requirements.txt#L1-L9
However, this significantly skews the data we have. We have anecdotal evidence that many people don't even know what
kedro new
is, and also teams often have their own templates, starters, ways of working.In addition, this blocks progress on relocating the optional dependencies of our starters kedro-org/kedro#2519.
kedro-telemetry
is installed and certain (not all) Kedro commands are run for the first time, a blocking prompt is presented to the user asking them whether they opt in for telemetry or not. This prompt causes lots of problems in different environments (see Improve running kedro as part of an automated workflow (CI/CD) kedro#1640). Current workarounds, like creating a.telemetry
file ahead of time, are finicky because sometimes it's not obvious what the working directory of the commands are. In these cases we have just told users topip uninstall kedro-telemetry
and go on with their lives, hence losing that information from our side.Effectively, the presence of this telemetry collection mechanism is both giving us biased data and also actively preventing our users to do their work.
Something has to change.
What do we want to change?
After exploring adjacent libraries and projects as part of #510 (comment), we observed that all of them have an opt-out telemetry collection mechanism.
Therefore, we want to converge with the rest of the ecosystem and make Kedro telemetry opt-out as well.
[Waves arms angrily]
We get it. Some developers and users don't like the idea of opt-out telemetry. Defaults matter.
And yet, if we fail to collect such telemetry, we fail to fulfill our goal of continuing to improve Kedro in a cost-effective way, hence all Kedro users are negatively impacted as a result.
As such, we have taken measures in the past few months to reduce the amount of data we collect:
This is reflected in our telemetry collection policy https://docs.kedro.org/en/0.19.6/configuration/telemetry.html and we still fully stand by it:
Therefore we're committed to storing the minimal amount of information possible, have none of that be personal information (not even IP addresses), make Kedro work exactly the same without it, and offer even more ways to opt out.
We also considered that we had to write all this for full transparency with our users.
What's next?
We are looking into ways to make telemetry collection opt-out, which means: it will be enabled by default for all Kedro Framework projects.
This means that, ideally, anyone who does
pip install kedro
and performs akedro run
ought to see a message like this:Notice the addition of the
KEDRO_DISABLE_TELEMETRY
andDO_NOT_TRACK
environment variables.The text was updated successfully, but these errors were encountered: