Parallel processing #101

mgvarley · 2019-12-09T01:41:29Z

Thanks for this excellent library, I have successfully used this to transform 40m records from Postgres to DynamoDb (I see there is an open issue from this so will work on a PR). I am working on a new etl pipeline which takes a csv, enriches it with a call to a web service and then generates a new csv with a line for each record. The file is huge (22m lines) and I need to be able to make multiple service calls in parallel for this to be efficient (~40). I think I need to use the cluster/worker model but I can’t work out how to use this. Would it be possible to add and example of how this should work? Many thanks.

mgvarley · 2019-12-16T15:00:22Z

I got this working without the cluster function finally. It could be better optimised but seems to work well for our needs. This is the code if it helps anyone (I am using fast-csv for the CSV formatting):

let counter = 0
etl.file(FILE_IN)
  .pipe(etl.csv())
  .pipe(etl.collect(PARALLEL))
  .pipe(etl.map(async function(docs) {
    await Promise.all(docs.map(async (doc) => {
      const { id, params } = doc
      res = await myapi.call(params)
      counter++
      if (counter % LOG_EVERY === 0) console.log(`${counter} rows processed`)
      this.push(_.extend({ id }, res[0]))
    }))
  }))
  .pipe(csv.format({ headers: true }))
  .pipe(etl.toFile(FILE_OUT))
  .promise()
  .then(() => {
    console.log('done')
  })
  .catch(e => {
    console.error(e)
  })

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel processing #101

Parallel processing #101

mgvarley commented Dec 9, 2019

mgvarley commented Dec 16, 2019

Parallel processing #101

Parallel processing #101

Comments

mgvarley commented Dec 9, 2019

mgvarley commented Dec 16, 2019