Skip to content

How We Built a Serverless PDF Processing Pipeline for Tax season

Published:  at  07:26 PM

This article is the first in a series covering IFF’s attempts at automating common recurring tasks for Non profit organisations operating in India.

Table of Contents

Open Table of Contents

Backstory

It’s May, its the Tax season. At the Internet Freedom Foundation (IFF), we’re preparing to send out 10BE certificates to our donors this time of the year.

A 10BE certificate is a document that proves a legit donation was made to an eligible nonprofit. Donors can use it to claim tax rebates. It’s important for us to ensure that certificates are sent to every single person who donated to us for a given financial year.

When we had fewer donors earlier, we sent these out manually. It took time, but it worked.

We now deal with hundreds, sometimes thousands, of these certificates. Each one needs to go to the right person. And they all come from a government portal that has no API that we know of, or that we could easily get access to anyway.

It was a perfect fit for automation.

Constraints

Before diving into the pipeline, we defined the constraints:

That gave us a narrow sandbox to build in. Which was good. Fewer decisions to make.

The Pipeline

Here’s what we came up with:

10BE pipeline on AWS

  1. We upload the PDFs to S3. They come directly from the Income Tax Website. Nothing interesting in the metadata. The FileName contains the ARN (or a unique ID) along with Name of the individual, so we could use it if we extract it.

  2. A Lambda function is triggered asynchronously . The moment a PDF hits the bucket, the Lambda receives the event object containing details like filename and the type of operation like PUT/DELETE, etc.

  3. The Lambda extracts text from the PDF. Nothing Fancy here, PDF parsing is a solved problem. We used an open-source Python library (Pypdf). Each 10BE certificate follows the same structure, so parsing is straightforward. We’re only interested in finding the PAN field corresponding to the donor, so we extract that and drop everything else.

  4. It matches the extracted data against a DynamoDB table. We upload a CSV of donor records to DynamoDB beforehand. The Lambda queries this table to find the right recipient. The CSV is a simple file which has EMAIL and a corresponding PAN record.

  5. It sends the PDF using SES. Once matched, the certificate goes out via email.

If any of the above steps fail, it gets sent to a dead letter queue, which we can manually look into later.

We also send the result to DynamoDB and CMS for record keeping, but that’s out of scope for this workflow.

That’s it.

There’s no infrastructure to provision. No queues. No retries. Just drop the PDFs into S3 and watch it send out emails.

Some Gotchas

What’s Missing

There’s still a lot left:

We’ll try and solve some of that next. As a NPO, we are constrained on resources, spending time on paperwork and bureaucracy only takes away the time from getting real meaningful work done, which is what our donors trusted us with in the first place.

Why Share This?

We know that there are probably a myriad of commercial tools that could do parts of this or even all of it. But we believe other small organizations might have the same problem. Budget constraints and high risk requirements.

Could this have been a python script? Sure, actually we tried it at first, worked well to be honest. But looking at it in the long-run, making it non-tech friendly was also one of our requirement. Right now, its just upload and forget.

This is our small attempt at helping orgs automate their workflows.

Link to repo.

If you’ve built something similar, or want to, feel free to reach out.



Next Post
Rethinking payment aliases - Moving beyond phone numbers