Locating a New Grocery Store

Congratulations! You’ve been hired by Whole Foods to help them identify potential locations for a new store in New York.

To aid in your analysis, we will be working with a dataset of all retail food stores in New York provided by the NY Department of Agriculture and Markets. The source for the data is here, but please work from the copy of the data here. We will use this data to identify areas that are under-served by grocery stores today (e.g. to find market opportunities).

Of course, it’s not enough to know where grocery stores currently are or are not – if it were, we could just advise our client to put some grocery stores out in the middle of no where! So we’ll also be using some demographic data from the US American Communities Survey (downloaded from NHGIS). You can find that data here. This is all public data, and as such is not nearly as detailed as data you can buy from marketing professionals, but even so you’ll see we can get a lot of value from it.

Exercise 1

Please download, import, and plot both datasets.

Exercise 2

Let’s begin by establishing where our CURRENT Whole Foods stores are located. Please create a new GeoDataFrame consisting of all Whole Foods locations in New York. How many Whole Foods are there currently in New York? Where are they predominantly located?

If you have questions about the variables in the NY retail food database, you can find all the documentation provided by the NY State Department of Agriculture and Markets in this folder.

Exercise 3

Now let’s create a GeoDataFrame with competing grocery stores. In the NY area, we’ll limit attention to Walmart (hint – there are a lot of walmarts), ACME Markets (Albertson’s), Trader Joe’s, ALDI, and any stores identifying themselves as supermarkets or super markets. Grocery, as a search term, catches a lot of delis and bodegas, which aren’t really in competition with Whole Foods.

(Note that in a real analysis, you’d want to be a little more careful to include any other non-chain grocery stores!)

Exercise 4

Now comes the fun part!

In the census data we loaded previously, we have information on the population, average income, and average education for every census block group (a “block group” is an official level of aggregation) in New York. But we don’t know how close those people are to an existing grocery store! So let’s use the data we’ve now created to gather information about what communities may currently be under-served by local grocery stores!

The first thing we want to do is rule out locations near existing Whole Foods.

Using sjoin_nearest, find the closest Whole Foods to every census block group. Also use the distance_col keyword argument (you can use sjoin_nearest? to read about it) to get the actual distance to said nearest Whole Foods.

Note that distance you get from this operation will be in meters because of how our data is being represented (it’s projection – something we’ll read about for our second GIS class).

Also note that we’re getting the distance to the nearest Whole Foods from each census block group, which are polygons. geopandas is smart enough to calculate the shortest distance from the polygon (i.e. the distance from the point on the polygon edge that is closest to the Whole Foods!).

Exercise 5

Let’s start building a dataset of possible grocery locations. First, let’s drop any census block groups that aren’t urban (Whole Foods are really an urban company). To do so, calculate the population density of each census block group and drop any census block groups with population densities below 100 people per square kilometer (bearing in mind that the units of this map are meters, so .area will return an area in meters squared).

Exercise 6

Now drop any block groups that are already less than 8 km from a Whole Foods.

Exercise 7

Now, for each remaining block group, calculate the distance to the nearest NON-Whole Foods. What does the distribution of those distances look like?

NOTE: When geopandas finds multiple observations at the same distance, it will keep them all. That can cause problems here – because there may be MULTIPLE competitors in the same block group you can end up with block groups being duplicated. Since we only care about the distance to the nearest competitor (not whether there are multiple at distance 0), drop these duplicates. Your observations should be unique on ["STATEFP", "COUNTYFP", "TRACTCE", "BLKGRPCE"].

Exercise 8

We’re now close to identifying places that are urban, not too close to an existing Whole Foods, and not too close to an existing competitor! We’re almost there!

But before we filter on distance to the nearest competitor, let’s also think about what a Whole Foods customer looks like. If this were a real consulting gig, we could ask Whole Foods for data on their current customers, but for the moment let’s just assert that they tend to be wealthy and highly educated.

Our census data already has a variable with median household income (md_hh_inc). Now also construct a variable the gives us the share of people over 25 in each Block Group that have at least a Bachelors Degree. This will entail using the variables that start with ALWG along with information from the codebook we referenced before.

Exercise 9

Great! At this point, we have four variables we’re selecting on: share with college, median household income, distance to nearest competitor, and distance to nearest Whole Foods. These’s no obvious way to balance these considerations, but for the moment let’s look for places where the share of people with college degrees is over 50%, median household income is over $90,000, and the nearest competitor’s grocery store is more than 8km away, and the nearest Whole Foods is over 8km away.

Using these filters, can you identify any counties that seem like especially good candidates (e.g. counties with a fair number of people in census block groups that fit our criteria?)

Exercise 10

Great work! We’ve found some candidates!

However, it’s worth noting that we have been a little crude here in a couple ways. First, obviously, in the real world we’d want more information on what demographic features best predict someone being a Whole Foods customer so we can put some weights on these various filters we’re applying.

But looking at distance to the nearest grocery store is also a bit of a crude approach to understanding potential customer’s grocery access. So let’s take a new approach!

Rather than measuring the distance to the nearest grocery story, let’s look at the number of grocery stores within 15km of each block group.

To do this, we’ll begin by using buffer() to expand each block group outward by the buffered distance. Then we can do a spatial join of these expanded polygons with our “other grocer” dataset.

To begin, buffer your block groups by 15 kilometers and set the new buffered polygons as your GeoDataFrame’s “offical geometry”

Exercise 11

Now merge your buffered polygons with your GeoDataFrame of other grocers.

Exercise 12

Now we have a dataset where we have one row for every (block-group) x (grocer within 15km) combination, so the next thing we have to do is collapse our data back down to the block-group level. So let’s groupby our data to get the COUNT of competitors within 15km.

Bear in mind that each block group is unique identified by "STATEFP", "COUNTYFP", "TRACTCE", "BLKGRPCE". Also note you won’t be able to groupby with your geometry columns in there, so you’ll need to do a groupby with a subset of columns, then merge your results back in.

Exercise 13

Now let’s try using the same filter we used before, but this time subsetting for census block groups with less than three competitors within 15km and no competitors within 5km.


  • college degrees is over 50%,

  • median household income is over $90,000,

  • nearest competitor’s grocery store is more than 5km away,

  • the nearest Whole Foods is over 5km away,

  • less than 3 competitors within 15km.

What counties seem like the most plausible new locations now?