Skip to content

Auth Tech Spec

Alexander Song edited this page Aug 1, 2024 · 8 revisions

Authentication Spec

๐Ÿšง This spec is a work in progress.

โ— This spec describes the end goal of a system with both local authorization and third-party OAuth2 integration. Not all of this will be delivered in the first milestone. Milestones are defined here.

Context

Phoenix has evolved from a notebook tool to an application backed by a database and deployed as an OCI container. Since building persistence into Phoenix, the most common ask from end users has been the ability to deploy Phoenix with authorization. While it's currently possible to secure an instance of Phoenix by deploying it behind a reverse-proxy and implementing custom authentication, this requires significant effort and expertise on the part of the user. Not only are spans, traces, and datasets potentially private, but certain planned features such as prompt playground require the storage of API keys. Building auth will allow users to easily and securely store sensitive data in deployed instances of Phoenix and will unlock development on a new set of features.

Our goal is to enable users to deploy authenticated instances of Phoenix in a straightforward and secure way.

Design Goals

Easy to Deploy, Easy to Use

Phoenix should be easy to use and should provide a great first-touch experience, including during setup and deployment. Deploying Phoenix is about as simple as possible, involving a single OCI container and an optional instance of Postgres. It should require minimal additional effort on the part of the user to deploy Phoenix with authorization.

Secure and Performant

We must also provide a secure solution that balances concerns for performance of the system as a whole.

Secure With No Additional Cost

Users should have the option but should not be required to use third-party authorization services with Phoenix. These services will be cost-prohibitive for some users.

Easy to Recover from Attacks

It should be as simple as possible for users to re-establish security if their credentials are stolen or their database is exposed.

User Stories

Persona Use Case
Amy is a first-time user of Phoenix who is dogfooding notebooks in Colab. She does not have a specific use-case in mind, but expects the tools she uses to be straightforward and simple. Amy is still being introduced to Phoenix as a product and has not yet learned that Phoenix is both a notebook tool and an application that can be self-hosted. She does not need any authentication while running in the notebook.
Alex is a student and individual user of Phoenix working on a personal project. He has heard about Arize-Phoenix and its capabilities for experimentation and iteration. His project is in the early stages of development and runs entirely locally. Alex is running Phoenix locally via the command line with python -m phoenix.server.main serve. He wants traces to be persisted to disk so that he can curate a dataset and iterate on his project over the course of several independent sessions. He neither wants nor expects Phoenix to come with any authentication. He appreciates how straightforward it is to use as a tool for local development.
Betty works on a small team at a startup that has just deployed their first LLM application to production. They are early in their observability journey, but realize that they donโ€™t have a way to incrementally improve the performance of their application, and so are investigating tooling for instrumentation, datasets, and experiments. They care mainly about ease of use and a good onboarding experiment. Betty requires authentication since her traces contain private customer data, but her team doesnโ€™t want to pay for a third-party OAuth2 provider such as Auth0. Her team uses Grafana, and she wants a similar first-touch experience that requires no additional integration or setup. Her team likes Phoenix from a product perspective, but they think itโ€™s a hassle to set up a reverse proxy with their own auth. She doesnโ€™t even want to bother with an SMTP relay for resetting passwords and is willing to manually edit a database if someone on her team forgets their password. She is used to working with API keys when dealing with LLM inference services and expects to be able to configure her instrumentation code similarly.
Brian was a participant at a hackathon sponsored by Arize and LlamaIndex. He heard the pitch for LlamaTrace, gave it a try, and is now a user and evangelist for both Phoenix and LlamaIndex in his company. He is using hosted Phoenix for a project at work. Brian wants to demonstrate the value of using LlamaIndex with Phoenix to his team by building an assistant for his companyโ€™s proprietary data. He has followed the LlamaTrace documentation for configuring his instrumentation, and expects everything to just work. He does not even realize that the authentication in his case is handled by a LlamaTrace service that is separate from Phoenix itself.
Cathy is a long-time user of Phoenix whose team has deployed a self-hosted Phoenix behind a reverse proxy with their own authentication. She is excited to hear that Phoenix now natively supports authentication. Her company already uses Auth0 throughout the organization. Cathy doesnโ€™t want just anyone with access to the domain where Phoenix is hosted to be able to create an account and sign in. Rather, she wants to use Auth0 to administrate who has access to Phoenix, so she has control over who can see traces and so that credentials for departing team members can easily be revoked.
Carl works at a consulting firm helping enterprise companies adopt generative AI. His team is currently working with a client who is developing their first assistant. Carl wants to deploy an instance of Phoenix to educate his client on observability and evaluation. He administrates his team of engineers in a third-party OAuth2 provider (e.g., Google Identity), but he also wants his clients to be able to create accounts and sign in with no additional help when provided with the link to the host instance of Phoenix.
Diana is deeply familiar with Phoenix. She subscribes to release notes and has incorporated Phoenix deeply within her startup's infrastructure, going so far as to leverage undocumented REST and GraphQL APIs to build out bespoke workflows for datasets, experiments, and human annotations. The thing she values about Phoenix is its hackability. Diana wants to be able to interact with Phoenix GraphQL and REST APIs in a secure manner. She not only wants API keys to be scoped to particular users, but she wants those keys to expire after a set amount of time. For things like annotations in particular, she wants to be able to attribute the source of the annotation to a particular user on her team. She uses GraphiQL and the Swagger UI for testing out requests as she is building.

User Flows

To make sure Phoenix is easy to deploy, users should not be required to configure third-party authorization services that introduce additional steps to the setup and deployment process. For this reason, Phoenix will offer a local authentication flow. Similarly, we should not require users to configure an SMTP server to perform basic operations such as resetting forgotten passwords.

Local Auth Deployment and First Touch

To deploy Phoenix with local authentication, the user configures one additional environment variable:

  • PHOENIX_ENABLE_LOCAL_AUTH: A boolean flag signaling whether to use local auth, defaulting to false. If true, PHOENIX_SECRET must also be set.
  • PHOENIX_SECRET: A secret used to sign and validate tokens and API keys. We will provide a strong recommendation in our documentation to randomly generate the secret in a secure way (e.g., openssl rand -base64 32) and will require secrets to be of a certain length to discourage weak secrets.

When Phoenix spins up for the first time, the user will be shown a sign up form field with instructions to enter their username, email address, and password, and will be told that they will be the first admin user. Passwords less than some minimum number of characters in length will be rejected. After entering their credentials, they will be redirected to the home page.

User Administration

Admin users will have access to a user administration page accessible via the left sidebar that will show all users in a table. This page will allow admins to:

  • create new users
  • view profiles for existing users, including username, email address, and role (admin vs. non-admin)
  • modify profiles for existing users, including username, email address, role, and password
  • delete existing users

Admins will be able to change their own role and the roles of other users, converting non-admins to admins and vice-versa. We will guarantee that there is always at least one admin by preventing actions that would remove the last admin.

User Profiles

Admins and non-admins alike will be able to view and change their own profile information associated with their own account, including their username, email address, and password via a profile page. When the user signs in via an OAuth2 integration.

Configuring SMTP Relay Services

Users will configure SMTP relay services via the following environment variables:

  • PHOENIX_ENABLE_SMTP: boolean flag, whether to enable SMTP
  • PHOENIX_SMTP_HOST: host of SMTP server
  • PHOENIX_SMTP_USER: email address to authenticate
  • PHOENIX_SMTP_PASSWORD: email address to authenticate
  • PHOENIX_SMTP_FROM_ADDRESS: address used when sending out emails (defaults to PHOENIX_SMTP_USER)
  • PHOENIX_SMTP_FROM_NAME: name used when sending emails (defaults to "Arize-Phoenix")

For example,

PHOENIX_ENABLE_SMTP=true
PHOENIX_SMTP_HOST=smtp.example.com
[email protected]
PHOENIX_SMTP_PASSWORD=supersecretpassword
[email protected]
PHOENIX_SMTP_FROM_NAME=Phoenix App

SMTP relays will just be for resetting passwords at first, but we may find additional use for them down the line.

Password Recovery

Via SMTP Relay

When an SMTP relay is configured, users will be able to click on a reset password link on the login page, which will prompt the user to enter their username or email address. If the input is a valid username or email address, a password recovery email will be sent to the user containing:

  • a button that links to reset page
  • an anchor link with explicit instructions to copy paste
  • a mention of how long the user has to reset their password

The link will have a short lifespan (e.g., 15 minutes) and will be usable only once.

With Admin Help

When no SMTP relay is configured, users will be able to recover their passwords with the help of an admin by:

  • requesting that an admin change their password to a temporary password
  • logging in with the temporary password
  • changing their password in the UI

Alternatively, instead of allowing admins to change user passwords directly, we can add a button by each user in the user administration table to allow admins to generate one-time password recovery links to send to users.

Manual Password Reset

As a last resort, admins should be able to reset their own forgotten passwords by manually editing the database. This is mainly a documentation task. Utility functions for generating salts and computing password hashes will be documented. We will provide explicit instructions for an admin to:

  • find their row in the users table and copy their salt
  • compute the hashed password with their salt and the new password of their choice
  • update their hashed password in the users table

Recovering from Security Breaches

We will provide a guide to follow when admins suspect their database has been compromised. In this case, an admin should first reset the secret to a new, randomly generated value. Doing this will invalidate all previously issued API keys and all passwords. We will provide instructions and code snippets for how to compute password hashes given a salt and password. The admin can then compute and update their own password hash in the users table to regain access to the UI and reset user passwords via the user administration page.

Managing API Keys

Phoenix's REST and GraphQL APIs must be accessible not only via a user session, but also programmatically. To facilitate this, Phoenix will issue two kinds of API keys, user and system API keys. As the name suggests, user API keys are associated with and act on behalf of the user to which they are issued. That user has the ability to view and delete their own user keys, and if the user is deleted, so are all of their associated user keys. A user might create their own user key into order to run an experiment in a notebook, for example.

System keys, in contrast, act on behalf of the system as a whole rather than any particular user. They can only be created by admins, are not meaningfully associated with the admin who creates them except for auditing purposes, and do not disappear if that admin is deleted. A system key would be the recommended kind of key to use in programmatic interactions with Phoenix that do not involve a user (e.g., automated flows querying our REST APIs).

General users will be able to manage their user API keys via an API key page, where they can:

  • create new API keys with:
    • name
    • optional description
    • optional lifespan
  • view existing API keys:
    • name
    • description
    • last four characters of the key (to help disambiguate between multiple keys)
    • expiration date
    • whether the key is still valid (keys can be expired or can be invalid because the PHOENIX_SECRET was changed)
  • delete API keys

Admins will be able to view a separate page showing all system keys.

Requirements:

  • Users will be able to copy API keys upon creation, but the full key should not be visible after that.
  • Users should be able to delete individual keys without affecting other keys.
  • Invalidating the PHOENIX_SECRET should invalidate all previously issued API keys. This provides a simple mechanism to secure the API in the event of an attack.

Using API Keys

Users will need to set their API key anytime they interact programmatically with Phoenix APIs.

Client

Users will be able to set their API key using a PHOENIX_API_KEY environment variable when using phoenix.Client and our experiments API. We may wish to add an api_key parameter to phoenix.Client, but there is not currently a user-facing client for the experiments API, so we may wish to rely purely on environment variables for the sake of consistency.

Requests

Raw requests to Phoenix REST or GraphQL APIs will need to attach the API key as a header (e.g., X-API-Key).

The user stories and user flows above show how the REST and GraphQL APIs both require session-based and API key-based auth:

Session-Based Key-Based
GraphQL using the UI programmatic access to GraphQL API
REST using the Swagger UI using phoenix.Client or sending raw requests

Access to these routes and resolvers will be controlled by fastapi dependencies and middleware. If we introduce scoped access at a later time (i.e., where users can access only a subset of our APIs), we can use strawberry permissions to provide more granular access to specific GraphQL resolvers.

Some non-sensitive routes will be left unsecured (e.g., /healthz, /arize-phoenix-version, static assets, etc.). We will add public POST routes for login and logout, e.g., /login and /logout. Since users need to be authenticated before they access our single-page React app, we'll need to serve a login form (i.e., templated HTML), e.g., on GET /login.

Technical Considerations

Session-Based Auth

Local Auth Bounce Diagram

๐Ÿšง This section is a work in progress.

Password Salting and Verification

๐Ÿง‚ Recommendation: Use a single salt, PHOENIX_SECRET

We will use so-called password "salting" to securely verify user credentials. The purpose of salting is to avoid storing passwords in plaintext and to mitigate against attacks using pre-computed hash tables. Salting passwords involves appending a random string of characters (called a "salt") to each password before hashing it. During verification, the system appends the salt to the input password, hashes the combination, and compares the result to the stored hash to see if they match. There are different strategies for choosing a salt. Some applications use a single salt via an environment variable (in our case, we would simply use PHOENIX_SECRET). The main advantage of this approach is it provides a simple lever to pull to invalidate all passwords in the case of an attack (the same lever to pull to invalidate all API keys).

Another strategy that is generally considered more secure is for the application to randomly generate salts on a per-user basis. This has a few advantages:

  • it increases the cost of brute-force attacks since hashes need to be computed not just for one salt, but for many salts
  • it ensures that every stored hash is unique even if multiple users have the same password, so a bad actor with database access cannot deduce that two users have the same password

In isolation, unique salts per user would be the way to go since they are no more complex to implement and are more resilient to brute-force attacks. However, this consideration is more important in the usual context where the authentication and resource databases are separate. In our case, they are one and the same, so if someone has compromised the users table containing hashed passwords and is running a brute-force attack over it, they've probably already compromised the tables containing sensitive data.

It would be possible to use a combination of both salting strategies, using both a system-wide salt and user-specific salts to gain the benefits of both. The benefit of doing this probably does not warrant the additional complexity.

๐Ÿงฎ Recommendation: Compute password hashes with PBKDF2 with SHA256 or scrypt (i.e., do as Django does)

Password hashes will be computed using a cryptographic hash function. The ideal hash function is slow enough to be resilient against brute-force attacks, but not so slow that it makes logging in noticably laggy. An algorithm such as SHA256 is generally considered too fast and susceptible to brute-force attacks by itself. The default algorithm used by Django is PBKDF2 with SHA256, which runs SHA256 several hundred thousand times and is available in the Python standard library. A more recent alternative is scrypt, a crypographic hash function that is tunable in terms of its computation time and memory usage and is also available in the Python standard library.

The following algorithms are supported by Django:

Cryptographic Hash Function Adjustable Time Adjustable Memory In Python Standard Library
PBKDF2 with SHA256 โœ… โŒ โœ…
bcrypt (used by Auth0) โœ… โŒ โŒ
scrypt โœ… โœ… โœ…
argon2 โœ… โœ… โŒ

For more information, see:

API Key-Based Authentication

๐Ÿ”‘ Recommendation: Use JWTs for API keys rather than opaque strings

API keys can be opaque strings (e.g., randomly generated hashes) or can have self-encoded content (e.g., JWTs).

JWTs (JSON Web Token), and in particular, JWSs (JSON Web Signature), are tokens that contain:

  • a base64-encoded JSON header containing metadata
  • a base64-encoded JSON payload containing data
  • a signature, the hash of the encoded header and payload plus a secret

A server (e.g., Phoenix) can issue a JWT signed with a secret (e.g., PHOENIX_SECRET) to a client. When the server receives a JWT from a client, it can compute a hash to determine whether the JWT was signed using its own secret and whether the payload and header have been changed. This means that the server can know whether to trust the data in a token simply by inspecting the token itself without maintaining state about the tokens it has issued.

The JWT payload for a Phoenix API key might look like:

{
    "sub": 19,
    "iat": 1516239022,
    "exp": 1516242622
}

where "sub", "iat", and "exp" are so-called "registered claims" standing for "subject", "issued at", and "expires at", respectively. In this case, "sub" represents the ID of the token in the database.

The main advantage of JWTs over opaque strings is that they allow several kinds of information to be checked without querying a database.

  • Unlike opaque strings, which would require storing salted hashes in a database, JWTs can be validated without consulting a database.
  • Expired tokens can rejected simply by inspecting the token itself rather than retrieving an expiry from the database.
  • In the future, we may add "scopes" to our API keys, so that certain scopes are required to access certain APIs (e.g., "read" vs. "write" scopes). If we use JWTs with a "scope" claim in the payload, server middleware can check whether an API key can use an API by simply inspecting the JWT.

For more info, see:

Even if we use JWTs for our API keys, we will still need to maintain state on all issued tokens so that users can view details of their previously issued keys (name, description, expiration, validity).

When validating an API key, checking that the API key has not expired does not guarantee that the key is still valid, since it is possible that a user or admin revoked the key in the UI. See the section on caching below.

Password Recovery URL

The password recovery URL sent via a configured SMTP relay will have as a query parameter a JWT containing the user ID and a short lifespan (e.g., 15 minutes). Its payload might look like:

{
    "sub": 19,
    "iat": 1516239022,
    "exp": 1516242622
}

In this case, "sub" refers to the database ID of the user. Once again, JWTs are preferred to randomly generated hashes because they enable expired and invalid tokens to be rejected without querying the database.

Caching Strategy

๐Ÿšง This section is in progress.

OAuth2 Integration

๐Ÿšง This section is in progress.

We will add several new database tables.

  • A user_roles table that will initially define three roles:
    • system: the role take by the system user described below
    • admin
    • general: non-admin users
  • A users table with a foreign key relation to the user_roles table
    • The users table will come pre-populated with a system user, the only user in the table granted the system role, who is the user associated with system API keys.
    • There may also be an initial admin user with a default password (if we wish to support this first-touch flow).
  • An api_keys table containing both kinds of API keys (system and user). This table includes information such as name, description, and expiry that will be displayed in the UI.
  • Separate tables for access, refresh, and password reset tokens. The motivation for using separate tables here is to be able to apply different retention policies on each. The purpose of these tables is to track whether individual tokens have been used, since they must be single-use only.

To avoid a long-lived feature branch and to give ourselves the flexibility to dogfood the table structure, we can try having a PHOENIX_DANGEROUSLY_ENABLE_EXPERIMENTAL_AUTH setting (a precursor to PHOENIX_ENABLE_AUTH) that would run the migration containing the new auth-related tables (we would configure this migration to not run otherwise).