Skip to content

Command line interface for interacting with Label studio and (deprecated) tog data server

Notifications You must be signed in to change notification settings

skit-ai/skit-labels

Repository files navigation

skit-labels

https://img.shields.io/github/v/tag/Vernacular-ai/tog-cli.svg https://cheeseshop.vernacular.ai/--badger--/tog.svg https://img.shields.io/github/workflow/status/Vernacular-ai/tog-cli/CI.svg

Command line interface for interacting with labelled datasets at skit.ai.

Installation

pip install skit-labels

Tog Datasets

Tog is our data annotation tool. This data server is our store of tagged/untagged data. Tagging efforts are organized in terms of jobs which keep a bunch of tasks to be tagged.

We can download and also upload data using this package.

DVC Datasets

We maintain an internal repository of datasets to perform common experiments. These are linked on private s3 and are accessible with necessary auth tokens. This tool can be used to download the same as mentioned in the usage guide.

We can only download our datasets on dvc for now.

Configuration

For almost all commands you will need credentials to be set in a few environment variables for the backend. An example follows. You can contact a team member to get the credentials for our server.

export TOGDB_HOST=localhost
export TOGDB_PORT=9999
export TOGDB_USER=username
export TOGDB_PASS=password

Usage

There are a couple of commands that come along with this package. We will snapshot the help message for each here.

The main command is the skit-labels command. This was the name of our annotation program. We may generalize it to have a more general name for all dataset requirements.

> skit-labels -h
usage: skit-labels [-h] [-v] {download,upload,describe,stats} ...

skit-labels 0.3.1. Command line interface for interacting with labelled datasets.

positional arguments:
  {download,upload,describe,stats}
    download            Download a dataset. of a given id from the database.
    upload              Upload a dataset.
    describe            Describe a dataset for a given tog id.
    stats               Get tagged/untagged points for a given tog id.

optional arguments:
  -h, --help            show this help message and exit
  -v                    Increase verbosity.

The following subcommand download allows downloading datasets from the database or our dvc repository. Either require authentication since these datasets are private.

> skit-labels download -h
usage: skit-labels download [-h] {db,dvc} ...

positional arguments:
  {db,dvc}
    db        Download a dataset of a given id from the database.
    dvc       Download a dataset from a dvc enabled repo.

optional arguments:
  -h, --help  show this help message and exit


Since we can download datasets for tog and dvc we have commands further enabling us to do the same.

> skit-labels download tog -h
usage: skit-labels download tog [-h] -j JOB_ID [-o {.csv,.sqlite}] [-tz TIMEZONE]
                                [--batch-size BATCH_SIZE] [--full]
                                [-tt {conversation,simulated_call,audio_segment,dict,call_transcription,data_generation}]
                                [--start-date START_DATE] [--end-date END_DATE]

optional arguments:
  -h, --help            show this help message and exit
  -j JOB_ID, --job-id JOB_ID
                        Id of the tog dataset that we want to download. (default: None)
  -o {.csv,.sqlite}, --output-format {.csv,.sqlite}
                        Store dataset in supported formats. (default: .csv)
  -tz TIMEZONE, --timezone TIMEZONE
                        Timezone to parse datetime values. Like 'America/Los_Angeles',
                        'Asia/Kolkata' etc. (default: UTC)
  --batch-size BATCH_SIZE
                        Number of items to download in a batch. (default: 500)
  --full                If provided, download all data instead of including untagged
                        datapoints. (default: False)
  -tt {conversation,simulated_call,audio_segment,dict,call_transcription,data_generation}, 
                        --task-type {conversation,simulated_call,audio_segment,dict,call_transcription,data_generation}
                        Task type for deserialization. (default: conversation)
  --start-date START_DATE
                        Filter items added to the dataset after this date. (inclusive)
                        (default: None)
  --end-date END_DATE   Filter items added to the dataset before this date. (exclusive)
                        (default: None)

> skit-labels download dvc -h
usage: skit-labels download dvc [-h] --repo REPO --path PATH [--remote REMOTE]

optional arguments:
  -h, --help       show this help message and exit
  --repo REPO      DVC enabled git repository. (default: None)
  --path PATH      Path to the dataset. (default: None)
  --remote REMOTE  Remote. Required only if the repo hasn't set a default remote. This is
                   usually a bucket name. (default: None)

We can describe a dataset on tog db using the following command.

> skit-labels describe -h
usage: skit-labels describe [-h] [--job-id JOB_ID]

optional arguments:
  -h, --help       show this help message and exit
  --job-id JOB_ID  Id of the tog dataset that we want to describe.

To know the data points that are tagged, untagged, skipped etc we use the stat command.

> skit-labels stats -h
usage: skit-labels stats [-h] [--job-id JOB_ID]

optional arguments:
  -h, --help       show this help message and exit
  --job-id JOB_ID  Check the state of the dataset i.e tagged, untagged and pending data
                   points for a given job-id.
#-end_src

> skit-labels upload tog -h
usage: skit-labels upload tog [-h] -j JOB_ID [--url URL] [--token TOKEN] [-i INPUT]

optional arguments:
  -h, --help            show this help message and exit
  -j JOB_ID, --job-id JOB_ID
                        Dataset id where the data should be uploaded. (default: None)
  --url URL             URL of the dataset server. Optionally set the DATASET_SERVER_URL
                        environment variable. (default: None)
  --token TOKEN         The organization authentication token. (default: fake_access_token)
  -i INPUT, --input INPUT
                        The raw data to be uploaded. (default: None)

Example

Download dataset from tog.

> skit-labels download tog --job-id=61 --output-format=.csv --task-type conversation

Upload dataset to tog for annotation.

skit-labels -vvvvv upload tog -j <int> --token=<token> --url https://apigateway.vernacular.ai

If you have used the skit-auth command line tool, then we would have saved the token in ~/.skit/config.json. If so, then the --token argument is optional. Do note that the organization information is embedded within the token. The upload will fail if incorrect token is used for uploading.

Task Types

Task type is an optional argument for downloading datasets from tog. Needed if you want to do type validation. If you don’t provide it, we just assume raw dictionary objects. The task types are:

  • conversation [default]
  • simulated_call
  • audio_segment
  • call_transcription
  • data_generation

Conversation

This is the most common task type. This accepts data from skit-calls | skit-fixdf.

Simulated Call

We build an interface to simulate conversation flows without actually deploying ML models. For generating NLU training data for a new client, we have a plan to simulate calls covering various situations and then voicing over them to generate training data. This has two benefits over our older method:

We don’t have to go through test calls twice (once for generating data and second for tagging) The simulator can define conditions and distributions for generating data instead of human callers which provide very biased and mostly top level intent data.

Call Transcription

Call transcription can be described as the activity where manual effort is used to listen and transcribe the calls. Call transcription is essential for training AI models, designing conversation flow and bot prompts. A user-friendly UI is the need of the hour for transcribing maximum calls with minimum effort and reasonable accuracy.

Data Generation

The interface allows setting intent and optionally entities. Once these are set, the interface allows recording audios repeatedly for rapid generation of data points. This dataset also lacks the structure that a Conversation Task dataset has for the very reason that we don’t have a flow / ml model deployed to produce these values.

All these datasets may need some pre-processing before we use them for training.