How We Built a Serverless PDF Processing Pipeline for Tax season

This article is the first in a series covering IFF’s attempts at automating common recurring tasks for Non profit organisations operating in India.

Open Table of Contents

Backstory
Constraints
The Pipeline
Some Gotchas
What’s Missing
Why Share This?

Backstory

It’s May, its the Tax season. At the Internet Freedom Foundation (IFF), we’re preparing to send out 10BE certificates to our donors this time of the year.

A 10BE certificate is a document that proves a legit donation was made to an eligible nonprofit. Donors can use it to claim tax rebates. It’s important for us to ensure that certificates are sent to every single person who donated to us for a given financial year.

When we had fewer donors earlier, we sent these out manually. It took time, but it worked.

We now deal with hundreds, sometimes thousands, of these certificates. Each one needs to go to the right person. And they all come from a government portal that has no API that we know of, or that we could easily get access to anyway.

It was a perfect fit for automation.

Constraints

Before diving into the pipeline, we defined the constraints:

The PDFs are sensitive. No third-party processing or uploading to places we don’t have control over.
No manual intervention.
We already use AWS. We stick to AWS.
Serverless where possible.
Not to retain any info we don’t strictly need.

That gave us a narrow sandbox to build in. Which was good. Fewer decisions to make.

The Pipeline

Here’s what we came up with:

10BE pipeline on AWS

We upload the PDFs to S3. They come directly from the Income Tax Website. Nothing interesting in the metadata. The FileName contains the ARN (or a unique ID) along with Name of the individual, so we could use it if we extract it.
A Lambda function is triggered asynchronously . The moment a PDF hits the bucket, the Lambda receives the event object containing details like filename and the type of operation like PUT/DELETE, etc.
The Lambda extracts text from the PDF. Nothing Fancy here, PDF parsing is a solved problem. We used an open-source Python library (Pypdf). Each 10BE certificate follows the same structure, so parsing is straightforward. We’re only interested in finding the PAN field corresponding to the donor, so we extract that and drop everything else.
It matches the extracted data against a DynamoDB table. We upload a CSV of donor records to DynamoDB beforehand. The Lambda queries this table to find the right recipient. The CSV is a simple file which has EMAIL and a corresponding PAN record.
It sends the PDF using SES. Once matched, the certificate goes out via email.

If any of the above steps fail, it gets sent to a dead letter queue, which we can manually look into later.

We also send the result to DynamoDB and CMS for record keeping, but that’s out of scope for this workflow.

That’s it.

There’s no infrastructure to provision. No queues. No retries. Just drop the PDFs into S3 and watch it send out emails.

Some Gotchas

If you upload 1000 PDFs at once, AWS is happy to spin up 1000 Lambdas. Your downstream systems might not be. To keep things sane, we use reserved concurrency. It limits how many Lambda instances can run at the same time. In our case, just enough to process certificates without overwhelming our CMS. You could instead use a SQS in between, but it seemed an overkill for our use case.
Sending out Emails with attachment using SES (boto SDK) is a PITA. We used built in email library to construct the correct body format for SES. I wish this was simpler, AWS definitely has the resources to fix this, I hope it does.
Don’t forget to set TTL’s and lifecycle rules on your dynamoDB records and s3 objects (certificates). We set 3 months as TTL on audit records and the actual PDF’s get wiped after 24 hours! Don’t store anything that’s not strictly required, keep it simple.

What’s Missing

There’s still a lot left:

Like uploading 80G certificate (a pre-requisite for 10BE) to the ITR website
Waiting for the approval from ITR and downloading it from the ITR portal
Connecting it to existing pipeline

We’ll try and solve some of that next. As a NPO, we are constrained on resources, spending time on paperwork and bureaucracy only takes away the time from getting real meaningful work done, which is what our donors trusted us with in the first place.

We know that there are probably a myriad of commercial tools that could do parts of this or even all of it. But we believe other small organizations might have the same problem. Budget constraints and high risk requirements.

Could this have been a python script? Sure, actually we tried it at first, worked well to be honest. But looking at it in the long-run, making it non-tech friendly was also one of our requirement. Right now, its just upload and forget.

This is our small attempt at helping orgs automate their workflows.

Link to repo.

If you’ve built something similar, or want to, feel free to reach out.