As a result, we decided to stop doing things manually and build an automated pipeline that fetches, checks and loads data into GrantNav on a daily basis. We also decided to make access to a relational database that can be queried using SQL — a standard language that many data scientists and researchers know how to use.
Our automated pipeline consists of three parts:
- a data getter which fetches the list from the registry, downloads published spreadsheets, and converts into JSON
- a data tester which validates, verifies, reports and adds metadata
- a data store which generates the latest valid dataset.
Automating loads
The pipeline is deliberately designed to be resilient.
We call the package the “latest valid dataset” because the datastore can automatically resolve any data integrity problems for up to a month, allowing 360Giving time to work with funders to resolve any issues.
As 360Giving data is published and distributed across funder’s websites, we’ve built in safeguards in case data becomes unavailable. If a website goes down or the page gets moved we are able to fill in the gaps by checking if we had the data before, and replacing the blank record with the last known good one.
We also built the pipeline to make maintenance more flexible. As GrantNav and the 360Giving data store are now not dependent on each other, we can make changes to one without affecting the others’ users.
Providing access to data
The datastore provides access to 360Giving data in the form of a relational database that is updated automatically on a daily basis. This work has massively decreased the path to building tools on top of 360Giving data, and means users are now able to query in more ways.
This ease of access meant that when 360Giving built their Covid grants tracker, fetching the data that powers it was as easy as writing an SQL query for a set of keywords, and running that query every morning after the datastore has updated. This allows for daily updates as new data is published. As well as providing data for GrantNav and COVID tracker, the datastore connects to Google Colab notebooks that we’ve developed to help prototype new tools and explore the data set.
What comes next
Working closely with 360Giving to understand their needs helped us to build resilient infrastructure that allows people to use data quickly and easily. This new automated pipeline both allows researchers better access to data, and frees up our time and space to make the data more useful for them.
In the future, we’ll keep working to understand 360Giving’s needs to improve and iterate on these systems, making 360Giving data more useful, usable and in use.
Find out more about Open Data Services Co-operative. If you’d like to get in touch, you can contact us at opendataservices.coop