Because vector spatial data includes geometric objects and projection meta-data, it generally can’t easily be stored in normal tabular formats like csvs. Instead, spatial data is generally stored in two main formats: shapefiles, and GeoJSON files. In this reading, we’ll talk briefly about these formats, and where to find spatial data!
Spatial Data Formats¶
Shapefiles are a very old, very simple format, and are kind of like the csvs of spatial data. It’s probably the most commonly passed around vector data format.
Shapefiles actually consistent of several files, all the with the same names but different suffixes: for example, a shapefile of data on counties might consistent of the following set of files:
counties.shx, etc. All shapefiles will contain a
.shp file, but after that bets on off on the number of files and the suffixes they will include.
To load a shapefile with geopandas, simply place all the shapefile files in a single folder and point the
gpd.read_file() function at the
.shp file – geopandas will do the work of looking for other files with the same name in the same folder.
Similarly, you can write a geodataframe (
gdf) to a shapefile with the command
gdf.to_file("my_shapefile.shp"). You’ll just find that more than one file has been created.
The one thing to be aware of about shapefiles is that they have odd restriction. For example, column names cannot be longer than 10 characters, so long column names or names with punctuation will get truncated on write. And string values in your data cannot be longer than 255 characters.
An increasingly popular new format for spatial data is GeoJSON. Unlike shapefiles, GeoJSON files are a single file with a
.geojson file suffix. Geopandas can read GeoJSON files with
gpd.read_file() (same function as above – geopandas will check the file suffix to determine if the file being read is a shapefile or GeoJSON), and write them with
CSVs for Points¶
The one place where normal data formats may be used for spatial data is when dealing with points, since a point is fully specified by a single x-coordinate and a single y-coordinate. As a result, you may often find that point data comes to you in the form of a CSV. When you get this kind of data, the two columns of a normal dataset can be turned into a GeoDataFrame easily with the
gpd.points_from_xy() function, something we’ll talk about in our readings on projections.
There are some other formats in the world, such as GeoPackages, which are also supported by geopandas. In general, though, I’ll admit I haven’t really seen them in the wild…
Finding Spatial Data¶
Below is some advice on finding spatial data for your own use. None of this is stuff you should memorize; rather, this is meant as a resource you can revisit if you need help in the future!
My most valuable advice for finding spatial data is: include “shapefile” or “geojson” in your query. Seriously – it makes a HUGE difference in terms of the likelihood you will actually find data and not just a site that talks about your subject!
Quick and Easy: DIVA-GIS¶
Want some country administrative borders (or administrative borders within a country), satelite images, elevation data, or other data? Check out DIVA-GIS. I don’t know that their data is always as trustworthy as data from official sources, but for quick-and-dirty analyses, it’s a great resource.
Government census data is often the underpinning of spatial analyses, because it’s available almost everywhere, is free, and has tons of information about… well, everyone!
The best resource for spatial census data is NHGIS (for US data) and IHGIS (for international data). These projects are run by the same folks – IPUMS who we’ve gone to in the past for individual level census data in the US or internationally. They’re amazing. You go to their site, tell them the geographic level at which you want data, and they will provide you will a list of available data. A few notes about using these services:
The larger the geographic area of aggregation, the more data they will be able to provide – privacy concerns mean that when geographic areas get really small, some data may be withheld to protect respondents.
They provide data in three files – a shapefile with a column called
GISJOIN, a tabular dataset with all your data and a
GISJOINcolumn, and a README that tells you what all the poorly named variables in the tabular data mean. So your first step with this data will almost always be to merge the tabular data with the shapefile using
GISJOIN, then renaming things based on the data in the README.
The other great resource is NASA. We aren’t covering raster data in detail in this class, but that’s not because it isn’t useful – NASA has satellite data for the whole world with information on things like flood risk, elevations, air pollution, what kinds of plants are growing in different places (by looking at what wavelengths plants reflect, satellites can identify crops!) and more. It’s… obscene how much data they have.
The Stanford Earthworks GIS Library is an amazing collection – not everything you find there will be accessible, but a lot will be, and if nothing else can point you to data providers. Other libraries also have some great resources (here’s Duke’s list), though I find the Stanford system to be especially searchable.
And if you want data you can’t find and are at a library, consider reaching out to a reference librarian – these are people whose job is to help people doing research find the information they need, and many librarians now have GIS specialists!
Selling data is a huge business these days, and there are a few groups that specialize in spatial data. I’ve never used any of them, but here are a few: