Downloading Scientific Datasets into TOPs (Part 01)

October 9, 2023

I’ve recently been fascinated with the data visualizations done by Zachary Labe (see here: https://zacklabe.com/arctic-sea-ice-figures/). On this site, he also details his methods and his data sources. This got me thinking about could be done with this data in Houdini, and the first step of working with data is acquiring said data. I wanted to see how much of this process could be done inside Houdini itself. This article is merely about the acquisition of data, not about what you can do with it. That may come in a later post.

This post will come in two parts. First, we’ll use a python script to download a simple CSV file and use the CSV Input node to parse the data. Second, we’ll write a more complex script to download a time-series of shapefiles.

The data I’ll be using for this article will be the NOAA’s NSIDC Sea Ice Index v3, which can be found here: https://nsidc.org/data/g02135/versions/3

CSV Download and Parsing

Let’s start with downloading a single CSV file and separating its information into TOP attributes.

First, enter a TOPs context and put down a Python Script node. The Python Script node is handy when you need to do run a straightforward python task that doesn’t require the creation of multiple work items. In this case, we’re downloading a single CSV file, so we will avoid using a more complex node like the Python Processor. We will come back to the Python Processor in the next section.

We will import one library into this script, requests. Requests is used to create HTTP requests, which we will use to download a target file:

import requests

Next, we will define where we want to save this file. First, I will define a string named file_path that holds the path I want save this file to. This could be set up to point towards a parameter or another Houdini data type, in this case I just hard-coded in the location I want to put it.

file_path = '$GEO/NSIDC/Sea_Ice_Index/N_seaice_extent_daily_v3.0.csv'

Before we get too far, I’ll want to register our file_path variable as the expected output of our work item, and tag it as a file/csv type. This is how we can easily pass onto the next node what file we want to process without having to set file paths again.

work_item.addOutputFile(file_path, "file/csv")

Now that our path and work item is configured, we need to actually fetch the data and write it to disk. We’ll use the requests.get() method to download the file, but first we need to figure out where the file is. On the NOAA webpage linked in the first section of this post, we need to scroll down to the “Data Access and Tools” section and open the “NOAA@NSIDC HTTPS File System”. We’ll be led to a rather sparse page with a few links. You can take a peek around, this directory holds all the data files we’ll be working with. The specific data set I’m using in this section is the daily extent table for the northern hemisphere, and it is located here under north/daily/data: https://noaadata.apps.nsidc.org/NOAA/G02135/north/daily/data/N_seaice_extent_daily_v3.0.csv

response = requests.get('https://noaadata.apps.nsidc.org/NOAA/G02135/north/daily/data/N_seaice_extent_daily_v3.0.csv')

With this, we’ve loaded our http request into the response variable, now we want to write the contents of that variable to disk:

with open(file_path, 'wb') as file:
    file.write(response.content)

The snippet of code above opens a new file at the file_path location, and we use the write() method to write the content of our http request to disk.

With this, our script should be complete. This node can now be dirtied and cooked.

Final script:

import requests

file_path = '$GEO/NSIDC/Sea_Ice_Index/N_seaice_extent_daily_v3.0.csv'
work_item.addOutputFile(file_path, "file/csv")

response = requests.get('https://noaadata.apps.nsidc.org/NOAA/G02135/north/daily/data/N_seaice_extent_daily_v3.0.csv')
with open(file_path, 'wb') as file:
    file.write(response.content)

Parsing the CSV

Luckily, TOPs comes with a couple built-in CSV nodes. The one we’ll focus on today is the CSV Input node, which reads a CSV file, and splits its rows into work items with attributes holding the values of each column. Put down the node and hook it up to the output of the Python script.

If we set up our work item output correctly, we should be able to simply pass along the downloaded file without any extra setup. Some CSVs (including this one) have a header row, i.e. a row at the beginning with column titles instead of data. Make sure to check the Has Header Row box if that is the case. This is handy for us because Houdini will automatically name our attributes based on the titles in the header row.

If we want to use every column provided to us, we’ll want to set “Extraction” to “All”. Otherwise, we’ll need to use one of the other options and configure the multiparm to read only the columns we specifically want.

After cooking with these settings, we should now get a list of the data values for each column, for each work item:

Great! Now we have our data in a format to do some fun stuff in Houdini. I’ll save that for another post, however. In the meantime, the next post (which will likely come out in the following days), we’ll set up a somewhat more complex setup for automatically acquiring a series of shapefiles from the same data source using the Python Processor node.