With the end of the school year and a lack of assignments, I'll hopefully have some more time to write about my projects. I'll start with one of the projects on my home page: Graffiti Tiles. In this project, I developed a site that automatically displays the last 24 hours of graffiti 311 cases in San Francisco. I used the same infrastructure as my 911 call map project to support this website. The main platform that I used to accomplish my goal was Cloudflare Workers. Similar to the 911 call project, I created a worker to make fetches to a Socrata database, which the San Francisco Data Team updates every day with new 311 cases across many categories.
Some of the categories that this database contains are quite interesting, while others contain little to no interesting data. Many of the 311 calls are related to street closures or construction projects, which I think could be interesting subjects for a future project using other data sources.
So let's get down to the technical challenges of this project. Firstly, from the reliability and optimization side, I didn't want to create unnecessary calls to the Socrata API and make the SF data people mad at me, so the purpose of the worker was to create a pseudo-API with a /data and /refresh endpoint. The /data endpoint serves my data, and the /refresh endpoint actually queries the database with the crafted input:
https://data.sfgov.org/resource/vw6y-z8j6.geojson?$select=*&$where=requested_datetime >= '{datetime}T00:00:00' AND service_name IN('Graffiti Private','Graffiti Public','Graffiti') AND media_url IS NOT NULL&$order=requested_datetime DESC&$limit=1000
One of the challenges that comes from pulling data from a city-managed database is that I have no control over what CDN they use. In the case of 311 service tickets, they use two different CDNS: Cloudinary, and Verint Cloud Services. On Cloudinary, the ticket attachments come in the form of full URL .jpeg images. On verint, the images are of the type data:image/jpeg;base64. Unfortunately, there is no reliable way to parse this type of image through a worker, so that's an unsolved problem.
Cloudflare also helps us automate our /refresh hits, by making cron jobs easily configurable. Using wrangler, I was able to set a cron job to populate my KV data every morning at 2 AM MT. In hindsight, this was a costly mistake that I made in my first iteration of the project. After a day of smooth data feeding into my site with manual calls to /refresh, I noticed that my scheduled cron jobs were resolving with 200 codes on Cloudflare's side, but I wasn't seeing any data on my /data endpoint or my frontend. After looking at my logs in wrangler for this worker and the one I used in my 911 radar project, I confirmed that this issue was not related to any Socrata or Cloudflare firewalling.
I realized that I had missed one check that would confirm if this was a cron issue or not: actually looking at the Socrata endpoint manually, without piping it through my own worker. I tried one of the endpoints that my cron job had just queried with no data, and honed my analysis in on the timestamp field. My cron job was fetching data from the past 24 hours at 2 AM UTC. When I looked at the Socrata database entry changelog, I noticed that updates were occurring at around 4 AM MT, with a few minutes of deviation. So that was the root cause of my issue-- my worker was not getting any data because it was querying before the data was updated.
I updated the cron job to follow the expression 0 12 * * * and voila, the following day I started pulling data automatically. Additionally, I offset my cron schedule by one hour to account for possible deviations due to daylight savings, as SF is in the PDT timezone for part of the year. This project taught me about cron scheduling, specifically troubleshooting cron misconfigurations and mismatches between what I can control and what is out of my hands.