Command line interface for interacting with labelled datasets at skit.ai.
pip install skit-labels
Tog is our data annotation tool. This data server is our store of tagged/untagged data.
Tagging efforts are organized in terms of jobs
which keep a bunch of tasks
to be tagged.
We can download and also upload data using this package.
We maintain an internal repository of datasets to perform common experiments. These are linked on private s3 and are accessible with necessary auth tokens. This tool can be used to download the same as mentioned in the usage guide.
We can only download our datasets on dvc for now.
For almost all commands you will need credentials to be set in a few environment variables for the backend. An example follows. You can contact a team member to get the credentials for our server.
export TOGDB_HOST=localhost
export TOGDB_PORT=9999
export TOGDB_USER=username
export TOGDB_PASS=password
There are a couple of commands that come along with this package. We will snapshot the help message for each here.
The main command is the skit-labels
command. This was the name of our annotation program. We may generalize it to
have a more general name for all dataset requirements.
> skit-labels -h usage: skit-labels [-h] [-v] {download,upload,describe,stats} ... skit-labels 0.3.1. Command line interface for interacting with labelled datasets. positional arguments: {download,upload,describe,stats} download Download a dataset. of a given id from the database. upload Upload a dataset. describe Describe a dataset for a given tog id. stats Get tagged/untagged points for a given tog id. optional arguments: -h, --help show this help message and exit -v Increase verbosity.
The following subcommand download
allows downloading datasets from the database or our dvc repository.
Either require authentication since these datasets are private.
> skit-labels download -h usage: skit-labels download [-h] {db,dvc} ... positional arguments: {db,dvc} db Download a dataset of a given id from the database. dvc Download a dataset from a dvc enabled repo. optional arguments: -h, --help show this help message and exit
Since we can download datasets for tog
and dvc
we have commands further enabling us to do the same.
> skit-labels download tog -h usage: skit-labels download tog [-h] -j JOB_ID [-o {.csv,.sqlite}] [-tz TIMEZONE] [--batch-size BATCH_SIZE] [--full] [-tt {conversation,simulated_call,audio_segment,dict,call_transcription,data_generation}] [--start-date START_DATE] [--end-date END_DATE] optional arguments: -h, --help show this help message and exit -j JOB_ID, --job-id JOB_ID Id of the tog dataset that we want to download. (default: None) -o {.csv,.sqlite}, --output-format {.csv,.sqlite} Store dataset in supported formats. (default: .csv) -tz TIMEZONE, --timezone TIMEZONE Timezone to parse datetime values. Like 'America/Los_Angeles', 'Asia/Kolkata' etc. (default: UTC) --batch-size BATCH_SIZE Number of items to download in a batch. (default: 500) --full If provided, download all data instead of including untagged datapoints. (default: False) -tt {conversation,simulated_call,audio_segment,dict,call_transcription,data_generation}, --task-type {conversation,simulated_call,audio_segment,dict,call_transcription,data_generation} Task type for deserialization. (default: conversation) --start-date START_DATE Filter items added to the dataset after this date. (inclusive) (default: None) --end-date END_DATE Filter items added to the dataset before this date. (exclusive) (default: None)
> skit-labels download dvc -h usage: skit-labels download dvc [-h] --repo REPO --path PATH [--remote REMOTE] optional arguments: -h, --help show this help message and exit --repo REPO DVC enabled git repository. (default: None) --path PATH Path to the dataset. (default: None) --remote REMOTE Remote. Required only if the repo hasn't set a default remote. This is usually a bucket name. (default: None)
We can describe
a dataset on tog db using the following command.
> skit-labels describe -h usage: skit-labels describe [-h] [--job-id JOB_ID] optional arguments: -h, --help show this help message and exit --job-id JOB_ID Id of the tog dataset that we want to describe.
To know the data points that are tagged, untagged, skipped etc we use the stat
command.
> skit-labels stats -h usage: skit-labels stats [-h] [--job-id JOB_ID] optional arguments: -h, --help show this help message and exit --job-id JOB_ID Check the state of the dataset i.e tagged, untagged and pending data points for a given job-id. #-end_src > skit-labels upload tog -h usage: skit-labels upload tog [-h] -j JOB_ID [--url URL] [--token TOKEN] [-i INPUT] optional arguments: -h, --help show this help message and exit -j JOB_ID, --job-id JOB_ID Dataset id where the data should be uploaded. (default: None) --url URL URL of the dataset server. Optionally set the DATASET_SERVER_URL environment variable. (default: None) --token TOKEN The organization authentication token. (default: fake_access_token) -i INPUT, --input INPUT The raw data to be uploaded. (default: None)
Download dataset from tog.
> skit-labels download tog --job-id=61 --output-format=.csv --task-type conversation
Upload dataset to tog for annotation.
skit-labels -vvvvv upload tog -j <int> --token=<token> --url https://apigateway.vernacular.ai
If you have used the skit-auth command line tool, then we would have saved the token in ~/.skit/config.json
.
If so, then the --token
argument is optional. Do note that the organization information is embedded within the token.
The upload will fail if incorrect token is used for uploading.
Task type is an optional argument for downloading datasets from tog. Needed if you want to do type validation. If you don’t provide it, we just assume raw dictionary objects. The task types are:
- conversation [default]
- simulated_call
- audio_segment
- call_transcription
- data_generation
This is the most common task type. This accepts data from skit-calls | skit-fixdf
.
We build an interface to simulate conversation flows without actually deploying ML models. For generating NLU training data for a new client, we have a plan to simulate calls covering various situations and then voicing over them to generate training data. This has two benefits over our older method:
We don’t have to go through test calls twice (once for generating data and second for tagging) The simulator can define conditions and distributions for generating data instead of human callers which provide very biased and mostly top level intent data.
Call transcription can be described as the activity where manual effort is used to listen and transcribe the calls. Call transcription is essential for training AI models, designing conversation flow and bot prompts. A user-friendly UI is the need of the hour for transcribing maximum calls with minimum effort and reasonable accuracy.
The interface allows setting intent and optionally entities. Once these are set, the interface allows recording audios repeatedly for rapid generation of data points. This dataset also lacks the structure that a Conversation Task dataset has for the very reason that we don’t have a flow / ml model deployed to produce these values.
All these datasets may need some pre-processing before we use them for training.