Retrieving Corona-Contaminated Flights from Official Document

March 20, 2020 - John Law - 2 mins read

Recently, the Centre for Health Protection of Hong Kong Government published the list of flights/trains/ships/vehicles taken by confirmed or probable cases of COVID-19. Well, the first thing that came into my mind was that the data is in .pdf. Wouldn't it be more useful if I transform it into .json and let it be ready for integration or fetching?

Work it through

This is a development note on the backend of transforming .pdf to .txt and finally to .json. As I have a server for the data, I can also provide a history of the list of flights that are taken by confirmed patients as another option.

I attack the problem with two tools: crawler and transformer. There are 3 types of file above. These two tools are responsible for the intermediate operations.

This is not different from any of other crawlers. I start by requesting the document and reading it. After that, I immediately store it in a raw.txt for debugging and logging. The transformation takes place here. I use a bunch of regular expression to remove and clean the raw data. Again, the result is stored in another .txt file for logging. However, I maintain a latest.txt for our next step. One could also perform tricks such as placing a character on the first/last line and renaming files with timestamp. After these operations, the crawling is done.

For .txt to .json transformation, it is generally the same. The server reads in the latest.txt and starts analyzing it. I make the text file an array so that it can be easily iterated. The most crucial part to notice one of these entries would be the flight code like CX320 or BA0027. These items are the indicators of a row. Once I have this information, everything else would easy for a real-time fetching. For history.json, that is a unique history record of latest.json, I just use a naïve $O(n^2)$ search to verify whether a entry exists or not. One extra note is that I could have use a database for creating history, but it's just fine for 100 elements, you know.

As for providing the data, I use gist and a crontab. Again, I could have use other CI or GitHub Actions. This may or may not be my plan.

They look good, but this is a project I really want to archive - these aren't good news. Stay healthy!

Links

coronaflight-history.json

coronaflight.json

coronaflight-hkg backend repository

This is John Law, signing off. You read 403 words.