Skip to content

Triggers

Tim L edited this page Jul 29, 2015 · 76 revisions

What is first

What we will cover

This page describes how to use and create triggers in csv2rdf4lod-automation. Triggers are used to encapsulate the replication/reproduction of a stage of conversion:

Let's get to it

1) Retrieval triggers

Retrieval triggers generate files in a conversion cockpit's source/ directory.

#!/bin/bash
#
#3> @prefix doap:    <http://usefulinc.com/ns/doap#> .
#3> @prefix dcterms: <http://purl.org/dc/terms/> .
#3> @prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
#3> 
#3> <> a conversion:RetrievalTrigger, doap:Project; # Could also be conversion:Idempotent;
#3>    dcterms:description 
#3>      "Script to retrieve and convert a new version of the dataset.";
#3>    rdfs:seeAlso 
#3>      <https://github.com/timrdf/csv2rdf4lod-automation/wiki/Automated-creation-of-a-new-Versioned-Dataset>,
#3>      <https://github.com/timrdf/csv2rdf4lod-automation/wiki/tic-turtle-in-comments>;
#3> .
Global retrieval trigger

A retrieval trigger may be global, i.e. cross-version. In this case, the retrieval trigger also creates the conversion cockpit directory. It also includes the logic to determine if a new versions should or should not be created. In the former case, calling the retrieval trigger multiple times has no effect.

Local retrieval trigger

A retrieval trigger may apply only to a specific dataset version. This can apply when the source organization handles versioning themselves (e.g. sociam). The global retrieval trigger is much more popular than local retrieval triggers.

Templates for a retrieval trigger

cr-retrieve.sh and cr-create-versioned-dataset-dir.sh can both be used as a template for writing your own retrieval trigger. cr-retrieve.sh is newer than cr-create-versioned-dataset-dir.sh, but they perform different functions.

  • cr-create-versioned-dataset-dir.sh can only be run from directories cr:directory-of-versions cr:conversion-cockpit, where cr-retrieve.sh can be run from anywhere within the data root (cr:data-root cr:source cr:dataset cr:directory-of-versions cr:conversion-cockpit)
  • cr-retrieve.sh accepts an argument --skip-if-exists to avoid retrieving if a version exists (here), while cr-create-versioned-dataset-dir.sh will determine what the version identifier should be and quit if it's the same (here).
  • cr-retrieve.sh will leverage DCAT access metadata if it is present (here), cr-create-versioned-dataset-dir.sh requires the download URL as a command line argument (here).
    • cr-retrieve.sh will special handle Google Spreadsheet URLs (here), cr-create-versioned-dataset-dir.sh doesn't (and would require a whole new script for that handling google2source.sh)
    • cr-retrieve.sh relies upon cr-create-versioned-dataset-dir.sh (here) to retrieve non-Google Spreadsheet URLs in DCAT access metadata.
  • cr-retrieve.sh will defer to a custom retrieve.sh trigger if it's present (here).
  • cr-create-versioned-dataset-dir.sh will perform file-specific handling on anything that is retrieved (e.g. unzipping zips, csv-ifying XLS, tidying HTML to valid XML, DROIDing for file formats, etc.); cr-retrieve.sh does not.
  • cr-create-versioned-dataset-dir.sh will defer to a custom preparation trigger (here) if it exists.
  • cr-create-versioned-dataset-dir.sh will pull the general conversion trigger cr-convert.sh after retrieving and preparing (here).

If the retrieval involves querying other SPARQL endpoints, consider using cache-queries.sh so that you can capture the provenance of the query.

(This paragraph has been superceded by Secondary Derived Datasets' enabling mechanism) Should the system-defined derived secondary datasets be enabled by creating their corresponding dataset directory in the data root (or, if enabled.txt exists in that directory, since git won't commit empty directories...)? This would be much simpler than editing the cron job, and then the cron job would be able to attempt everything and each dataset would just skip if the directory does not exist. Then, the --enable (or --force?) flag could be used to create the dataset even if the directory doesn't exist. If we adopt this design, we'll need to do a lot of revisiting for the existing derived secondary datasets.

Remember to be polite when requesting:

#!/bin/bash

bps=$(($RANDOM%2000))
echo "bps $bps"
curl --limit-rate ${bps}K -L "$url" > "$here"
sec=$(($RANDOM%15))
echo "bps $bps; zzz $sec..."
sleep $sec

throttling wget or curl

Trigger crib sheets are here.

I often use this wget idiom to create a wget version, which can be considered an alias for "latest", then a tarball of that version can be made for any particular archive date as another version.

2) Preparation triggers

A global prepare.sh (or, historically 2manual.sh) can encapsulate intermediate tweaks that may be required before conversion. Preparation triggers use files in source/ and generate files in manual/.

#!/bin/bash
#
#3> <> a conversion:PreparationTrigger; # Could also be conversion:Idempotent;
#3>    foaf:name "prepare.sh";
#3>    rdfs:seeAlso
#3>     <https://github.com/timrdf/csv2rdf4lod-automation/wiki/Automated-creation-of-a-new-Versioned-Dataset>,
#3>     <https://github.com/timrdf/csv2rdf4lod-automation/wiki/Triggers>,
#3>     <https://github.com/timrdf/csv2rdf4lod-automation/wiki/Conversion-trigger>,
#3>     <https://github.com/timrdf/csv2rdf4lod-automation/wiki/Conversion-cockpit>;
#3> .
#
# This script is responsible for processing files in source/ and storing their modified forms
# as files in the manual/ directory. These modified files should be ready for conversion.
# 
# This script is also responsible for constructing the conversion trigger
#   (e.g., with cr-create-conversion-trigger.sh -w manual/*.csv)
#
# When this script resides in a cr:directory-of-versions directory,
# (e.g. source/datahub-io/corpwatch/version)
# it is invoked by retrieve.sh (or cr-retrieve.sh).
#   (see https://github.com/timrdf/csv2rdf4lod-automation/wiki/Directory-Conventions)
#
# When this script is invoked, the conversion cockpit is the current working directory.
#
  • if source/*.xsl, then they will be converted to manual/*.csv because of this.
  • ../../src/html2csv.xsl will convert a source/*.html.tidy into manual/*.csv because of this.

Mapping source paths to automatic paths:

for owl in `find source -name "*.owl"`; do
   turtle="automatic/${owl#source/}" && turtle="${turtle%.owl}.ttl"

3) Computation triggers

#!/bin/bash
#
#3> <> a conversion:ConversionTrigger; # Could also be conversion:Idempotent;
#3>    foaf:name    "convert.sh";
#3>    rdfs:seeAlso <https://github.com/timrdf/csv2rdf4lod-automation/wiki/Triggers#wiki-3-computation-triggers>;
#3> .
#

Types of computation:

4) Publication triggers

#!/bin/bash
#
#3> <> a conversion:PublicationTrigger; # Could also be conversion:Idempotent;
#3>    foaf:name    "publish.sh";
#3>    rdfs:seeAlso <https://github.com/timrdf/csv2rdf4lod-automation/wiki/Triggers#wiki-4-publication-triggers>;
#3> .
#

cr-publish.sh can be used to find and pull all publication triggers. It can be dryrun with argument -n, and it can be told to only pull triggers that promise to be idempotent with argument --idempotent.

data/source$ cr-publish.sh -n --idempotent

Trigger argument 'clean'

if [[ "$1" == 'clean' ]]; then
   echo rm source/reverts.rq.sparql*
        rm source/reverts.rq.sparql*
   if [[ "$2" == 'all' ]]; then
      echo rm automatic/*
           rm automatic/*
   fi
   exit
fi
  • ./retrieve.sh clean removes source/, automatic/
  • ./prepare.sh clean does not remove source/, but removes automatic/
  • ./convert.sh clean
  • ./publish.sh clean

xargs cheat sheet

Parallelize with recursive calls via xargs

Pattern to parallelize processing by calling ourselves recursively:

if [[ -e "$1" ]]; then
   while [[ $# -gt 0 ]]; do
      json="$1" && shift
      ttl="$json.ttl"
      if [[ ! -e $ttl ]]; then
         mkdir -p `dirname $ttl`
      fi
   done
   exit
else
   # https://github.com/timrdf/csv2rdf4lod-automation/wiki/Triggers#parallelize-with-recursive-calls-via-xargs
   find -L automatic -name *.json -print0 | \
      xargs -0 -n 1 -P ${CSV2RDF4LOD_CONCURRENCY:-1} -I json $0 json
fi

dryruns

[[ "$1" == '-n' ]] && dryrun='yes' && shift || dryrun=''
if [[ -n "$dryrun" ]]; then
   echo "it's a not a dryrun!"
fi

What is next

Clone this wiki locally