Spatial Data¶
Because vector spatial data includes geometric objects and projection meta-data, it generally can’t easily be stored in normal tabular formats like csvs. Instead, spatial data is generally stored in two main formats: shapefiles, and GeoJSON files. In this reading, we’ll talk briefly about these formats, and where to find spatial data!
Spatial Data Formats¶
Shapefiles¶
Shapefiles are a very old, very simple format, and are kind of like the csvs of spatial data. It’s probably the most commonly passed around vector data format.
Shapefiles actually consistent of several files, all the with the same names but different suffixes: for example, a shapefile of data on counties might consistent of the following set of files: counties.shp
, counties.dbf
, counties.prj
, counties.shx
, etc. All shapefiles will contain a .shp
file, but after that bets on off on the number of files and the suffixes they will include.
To load a shapefile with geopandas, simply place all the shapefile files in a single folder and point the gpd.read_file()
function at the .shp
file – geopandas will do the work of looking for other files with the same name in the same folder.
Similarly, you can write a geodataframe (gdf
) to a shapefile with the command gdf.to_file("my_shapefile.shp")
. You’ll just find that more than one file has been created.
The one thing to be aware of about shapefiles is that they have odd restriction. For example, column names cannot be longer than 10 characters, so long column names or names with punctuation will get truncated on write. And string values in your data cannot be longer than 255 characters.
GeoJSON¶
An increasingly popular new format for spatial data is GeoJSON. Unlike shapefiles, GeoJSON files are a single file with a .geojson
file suffix. Geopandas can read GeoJSON files with gpd.read_file()
(same function as above – geopandas will check the file suffix to determine if the file being read is a shapefile or GeoJSON), and write them with gdf.to_file("my_geojson.geojson", driver="GeoJSON")
.
CSVs for Points¶
The one place where normal data formats may be used for spatial data is when dealing with points, since a point is fully specified by a single x-coordinate and a single y-coordinate. As a result, you may often find that point data comes to you in the form of a CSV. When you get this kind of data, the two columns of a normal dataset can be turned into a GeoDataFrame easily with the gpd.points_from_xy()
function, something we’ll talk about in our readings on projections.
Other Formats¶
There are some other formats in the world, such as GeoPackages, which are also supported by geopandas. In general, though, I’ll admit I haven’t really seen them in the wild…
Finding Spatial Data¶
Below is some advice on finding spatial data for your own use. None of this is stuff you should memorize; rather, this is meant as a resource you can revisit if you need help in the future!
Google Keywords¶
My most valuable advice for finding spatial data is: include “shapefile” or “geojson” in your query. Seriously – it makes a HUGE difference in terms of the likelihood you will actually find data and not just a site that talks about your subject!
Quick and Easy: DIVA-GIS¶
Want some country administrative borders (or administrative borders within a country), satelite images, elevation data, or other data? Check out DIVA-GIS. I don’t know that their data is always as trustworthy as data from official sources, but for quick-and-dirty analyses, it’s a great resource.
Government Census Data¶
Government census data is often the underpinning of spatial analyses, because it’s available almost everywhere, is free, and has tons of information about… well, everyone!
The best resource for spatial census data is NHGIS (for US data) and IHGIS (for international data). These projects are run by the same folks – IPUMS who we’ve gone to in the past for individual level census data in the US or internationally. They’re amazing. You go to their site, tell them the geographic level at which you want data, and they will provide you will a list of available data. A few notes about using these services:
The larger the geographic area of aggregation, the more data they will be able to provide – privacy concerns mean that when geographic areas get really small, some data may be withheld to protect respondents.
They provide data in three files – a shapefile with a column called
GISJOIN
, a tabular dataset with all your data and aGISJOIN
column, and a README that tells you what all the poorly named variables in the tabular data mean. So your first step with this data will almost always be to merge the tabular data with the shapefile usingGISJOIN
, then renaming things based on the data in the README.
Public Satellite Data¶
Another great spatial data resource is satellite data! We aren’t covering raster data in detail in this class, but that’s not because it isn’t useful – NASA has satellite data for the whole world with information on things like elevation, flood risk, air pollution, what kinds of plants are growing in different places (by looking at what wavelengths plants reflect, satellites can identify crops!), satellite imagery (used for things like studying energy infrastructure, or for “financial intelligence” firms doing things like studying factory activity to predict company earnings ahead of official announcements), and more. It’s… obscene how much data they have.
While most of this comes from NASA or NOAA, in the same way most people get their census from IPUMS (not govt census bureaus), most people I know actually get their satellite data from either the Microsoft Planetary Computer, or AWS Open Data Registry
Environmental Data¶
While most environmental data I think is satellite data (see note above), the Microsoft Planetary Computer has some other fun resources (like this database of labelled images of wildlife!)
Other Public Data on AWS¶
The AWS Open Data Registry, while poorly organized, has great data on genomics and health data from the NIH, all the environmental data noted above, space telescope data, and more!
Library Collections¶
The Stanford Earthworks GIS Library is an amazing collection – not everything you find there will be accessible, but a lot will be, and if nothing else can point you to data providers. Other libraries also have some great resources (here’s Duke’s list), though I find the Stanford system to be especially searchable.
And if you want data you can’t find and are at a library, consider reaching out to a reference librarian – these are people whose job is to help people doing research find the information they need, and many librarians now have GIS specialists!
Got Money?¶
Selling data is a huge business these days, and there are a few groups that specialize in spatial data. I’ve never used any of them, but here are a few:
SafeGraph: Have lots of cell-phone tracking data they use to tell companies where potential customers are actually walking around.