U.S. National Parks PostGIS Project

In this project I use Python and PostGIS to find nearby national parks.¶

After finding a website with the information I need, I use Python and BeautifulSoup to scrape the data.¶

In [ ]:
from bs4 import BeautifulSoup
import requests 

page = requests.get("https://www.latlong.net/category/national-parks-236-42.html")

soup = BeautifulSoup(page.content,'html.parser')

table = soup.find('table')

The raw data is not cleanly formatted, so I clean it and then create a csv file of the cleaned data which I will use with the PostgreSQL database and PostGIS.¶

In [ ]:
rows = []
name_list = []
for i, row in enumerate(table.find_all('tr')):
    if i == 0:
        pass
    else:
        rows.append([el.text.strip() for el in row.find_all('td')])

def get_state_name(qs):
    state_dict = {
                    'ak': 'alaska', 'al': 'alabama','ar': 'arkansas', 'az': 'arizona','ca': 'california','co': 'colorado','ct': 'connecticut','de': 'delaware',
                    'fl': 'florida','ga': 'georgia','hi': 'hawaii','ia': 'iowa','id': 'idaho','il': 'illinois','ks': 'kansas',
                    'ky': 'kentucky','la': 'louisiana','ma': 'massachusetts','md': 'maryland','me': 'maine','mi': 'michigan',
                    'mn': 'minnesota','mo': 'missouri','ms': 'mississippi','mt': 'montana','nc': 'north carolina','nd': 'north dakota','ne': 'nebraska',
                    'nh': 'new hampshire','nj': 'new jersey','nm': 'new mexico','nv': 'nevada',
                    'ny': 'new york','oh': 'ohio','ok': 'oklahoma','or': 'oregon','pa': 'pennsylvania','ri': 'rhode island','sc': 'south carolina',
                    'sd': 'south dakota','tn': 'tennessee','tx': 'texas','ut': 'utah','va': 'virginia','vt': 'vermont',
                    'wa': 'washington','wi': 'wisconsin','wv': 'west virginia','wy': 'wyoming'
                        }
    for the_key in state_dict.keys():
        if the_key in qs.split():
            return state_dict[the_key]
        


with open("national_parks.txt","a",encoding = 'utf-8',errors='replace') as f:
    park_id = 100
    for row in rows:
        park = row[0]
        name = park.split(",")[0]
        if len(park.split(",")) == 3:
            location = park.split(",")[1]
        elif len(park.split(",")) == 4:
            location = park.split(",")[2]
        else:
            pass
        if location in name_list:
            continue
        else:
            name_list.append(location)
        lat = row[1]
        long = row[2]

        if len(location.lstrip(' ')) == 2:
            location = location.lower()
            location = get_state_name(location).title()

        the_line = f"{park_id},'{name}','{location.lstrip(' ')}',{row[1]},{row[2]}\n"
        f.write(the_line)
        park_id = park_id + 10

Now in PostgreSQL, after adding the postgis extension if needed, I create a table that will store the cleaned national parks data.¶

In [ ]:
CREATE EXTENSION postgis;

CREATE TABLE national_parks(
   park_id smallint PRIMARY KEY,
   park_name text NOT NULL,
   park_location text NOT NULL,
   park_latitude numeric(9,6),
   park_longitude numeric(9,6));

Load the data from the file into the table.¶

In [ ]:
COPY national_parks
FROM 'C:\Users\hunte\Desktop\national_parks.txt'
WITH (FORMAT CSV, QUOTE  "'");

Now that the data is loaded, I alter the table to create a new spatail datatype column.¶

In [ ]:
ALTER TABLE national_parks ADD COLUMN geog_point geography(POINT,4326);

UPDATE national_parks 
   SET geog_point =
   ST_SetSRID(
   ST_MakePoint(park_longitude,park_latitude)::geography,4326);

Although not strictly necessary for a table of this size, I next create an index to speed up searches.¶

In [ ]:
CREATE INDEX national_parks_idx ON national_parks USING GIST (geog_point);

View a few rows to check that everything is inserted into the table as expected.¶

In [ ]:
SELECT park_name, park_latitude, park_longitude, geog_point,
  ST_AsEWKT(geog_point)
  FROM national_parks
  LIMIT 3;

Find parks within 450 kms (about 280 miles) of Denver, CO. (The Denver coordinates were obtained from a Google search.)¶

In [ ]:
SELECT *
  FROM national_parks
  WHERE ST_DWithin(geog_point,
 ST_GeogFromText('POINT(-104.991531 39.742043)'),
 450000)
  ORDER BY park_name;

Now I want to see how far away those are in miles.¶

In [ ]:
SELECT park_name,
  park_location,
  round(
    (ST_Distance(geog_point,
    ST_GeogFromText('POINT(-104.991531 39.742043)')
    ) / 1609.344)::numeric,2
  ) AS miles_from_dt
  FROM national_parks
  WHERE ST_DWithin(geog_point,
    ST_GeogFromText('POINT(-104.991531 39.742043)'),
    450000)
  ORDER BY miles_from_dt ASC;

Finally, I use k-nearest neighbors to get PostGIS to find the 5 nearest parks to Denver.¶

In [ ]:
SELECT park_name,
  park_location,
  round(
    (ST_Distance(geog_point,
    ST_GeogFromText('POINT(-104.991531 39.742043)')
    ) / 1609.344)::numeric,2
  ) AS miles_from_dt
  FROM national_parks
  ORDER BY geog_point <-> ST_GeogFromText('POINT(-104.991531 39.742043)')
  LIMIT 5;