March 20, 2020 - John Law - 2 mins read
Recently, the Centre for Health Protection of Hong Kong Government published the list of flights/trains/ships/vehicles taken by confirmed or probable cases of COVID-19. Well, the first thing that came into my mind was that the data is in
.json and let it be ready for integration or fetching?
This is a development note on the backend of transforming
.txt and finally to
.json. As I have a server for the data, I can also provide a history of the list of flights that are taken by confirmed patients as another option.
I attack the problem with two tools: crawler and transformer. There are 3 types of file above. These two tools are responsible for the intermediate operations.
This is not different from any of other crawlers. I start by requesting the document and reading it. After that, I immediately store it in a
raw.txt for debugging and logging. The transformation takes place here. I use a bunch of regular expression to remove and clean the raw data. Again, the result is stored in another
.txt file for logging. However, I maintain a
latest.txt for our next step. One could also perform tricks such as placing a character on the first/last line and renaming files with timestamp. After these operations, the crawling is done.
.json transformation, it is generally the same. The server reads in the
latest.txt and starts analyzing it. I make the text file an array so that it can be easily iterated. The most crucial part to notice one of these entries would be the flight code like CX320 or BA0027. These items are the indicators of a row. Once I have this information, everything else would easy for a real-time fetching. For
history.json, that is a unique history record of
latest.json, I just use a naïve search to verify whether a entry exists or not. One extra note is that I could have use a database for creating history, but it's just fine for 100 elements, you know.
They look good, but this is a project I really want to archive - these aren't good news. Stay healthy!
This is John Law, signing off. You read 403 words.
Copyright © 2020 John Law