Downloading Scientific Datasets into TOPs (Part 02)

August 1, 2024

earth

In this post, we’re going to bring in shapefiles representing the daily median arctic sea ice extent for the years 1981-2010. I’ve packaged this one as a little HDA for tidiness, taking one parameter: The location where we want to save our shapefiles.

Inside of the HDA, there aren’t many nodes. Of course, we have a scheduler, and since we’re doing a relatively lightweight task that could benefit from being run on many cores, I’ve changed the “Total Slots” parameter to “Equal to CPU Count Less One”. After that, make sure that your TOP nodes are using your bundled scheduler instead of the default one. One way to do this is by going to the Schedulers tab of each TOP node and plugging your custom scheduler into the “TOP Scheduler Override”.

Besides the scheduler, there are three other nodes of note on the TOP level. A Python Process which downloads the data, a SOP Network which process the data into geometry, and a ROP Geometry Output to cache out the generated geometry.

Downloading our Data with Python

Python Processors are handy ways to build custom TOP nodes. They require a bit more manual intervention than your typical TOP node, however, as we need to tell it ourselves how to create work items. The Python Processor has different tabs for scripts that run at each step of the cook. For this one, we want to edit two tabs: Generate, which runs at the beginning and creates work items, and Cook Task, which runs for each generated work item.

Generate

When you first put down the Python Processor, Generate is pre-filled with a for loop that create one work item per input work item.

for upstream_item in upstream_items:
    new_item = item_holder.addWorkItem(parent=upstream_item)

We want our system to work without needing input work items. We can create work items from scratch here in the Python Processor, so we’ll just delete those two lines.

Create Folders

I said that we want to create work items here, but there’s one other thing we should take care of at the same time. We want to make sure the folder we want to save our Shapefiles too exists, and if it doesn’t, we want to make it. It’s better to run this code here in Generate than in Cook Task, because the code in Generate will run once at the beginning, unlike Cook Task which will run for each work item. We only need to make sure the folder exists once. Thus, we’ll break our Generate code into two sections. First, let’s write the logic for the folder creation. Your code here may differ. At the end of the day, base_path simply needs to be a string holding the file path of the folder you want to save your shapefiles.

import os.path
import hou

# Fetch $GEO
geo = hou.getenv("GEO")

# Make sure parent folder for the data exists.
base_path = os.path.join(geo, "NSIDC/Sea_Ice_Index/shapefiles")
if not os.path.exists(base_path):
    os.makedirs(base_path)

I personally use a system environment variable named GEO to set the folder where caches for this project go. All $GEO holds is a file path, in this case /cg/Houdini Projects/Climate/GEO. I import hou and call hou.getenv() to fetch the value of the $GEO variable. I then use os.path.join to combine the second half of the file path to the first, so the full final file path becomes /cg/Houdini Projects/Climate/GEO/NSIDC/Sea_Ice_Index/shapefiles. At the end of the day, you just need to set base_path to the path of your folder. If that means replacing the block of code from “base_path = [...]” up with “base_path = "/path/to/your/folder”, then so be it. I use a python program to manage my projects which is why I have some added complexity.

Now that we have our base_path, we can use the last two lines, the if-statement, to create the folder if it doesn’t exist. os.path.exists returns a True or False if the path exists. If not, os.makedirs(base_path) makes the directory. Otherwise, nothing happens.

Create Work Items

After the above code, we want to add the following code:

# Create 366 work items for the days of the year
for n in range(366):
    new_item = item_holder.addWorkItem(inProcess=True, index = n + 1)

“for n in range(x)” runs the code that follows ‘x’ times. In this case, we run “new_item = item_holder.addWorkItem([...])” 366 times. The addWorkItem method on the provided item_holder object, as its name suggests, adds a new work item to the node. Note that we need to set the “index” attribute of the work item.

At this point, cooking the node should create 366 green dots and a folder.

Cooking

Now we need to define the behavior of each work item when it cooks. The quick and dirty of what needs to be done here is that we need to download the data from the NOAA Data Portal. Problem is, it comes packed as ZIP files, and we only care about the .shp files inside. So we need our work item to download the zip, unzip it, delete the zip file, then register the remaining shapefile as the Output File of the work item. Since the python snippets from here get a little longer, I’m going to do less line-by-line description and more general description of chunks of code.

Lines 15 and 17 just sets the base directory I want to save my data files into. Line 16 creates the urllib3 Pool Manager that handle our web requests to the NOAA data server.

The next block of code fetches information about the currently processing work item. We create a handful of variables representing the filename of the downloaded zip file, the path where that file will be saved, as well as the file path of the final unzipped shapefile.

The next block of code is nested inside of an if block that checks if a given shapefile has already been downloaded before. The with block sends an http request to the noaa server for the file matching the filename for the current work item. The response of that request is then saved to the file defined in file_path. The shutil.unpack_archive() command then extracts the zip file.

The following for block runs through all the newly created files and deletes anything ending in the .zip extension, leaving behind only the .shp files.

The final line then registers the value of the shp_path variable as the “Output File” attribute of the work item, allowing us to easily pass on the shapefile as input to later TOP nodes.

With that, we’ve acquired our data, so in the next one I’ll break down how to represent that data as Geometry in Houdini.