Documentation

FastOpenData

We make it easy to improve your analytics capabilities by leveraging open source data.

Latest Posts

Tags

Note: We’re in closed alpha testing, so our servers are not working for requesting API keys yet!

Using FastOpenData

There are three steps for using the FastOpenData API to get information about addresses in the United States:

  1. Installing the client
  2. Getting a free API key
  3. Retrieving data

If you don’t want to use the official Python client, you can easily roll your own client or use tools such as curl or wget to make ad-hoc requests. There are examples of valid curl requests below. But we recommend at least starting with the Python client.


Installing the client

You don’t need to use the official Python client, but using the client is the easiest way to interact with the FastOpenData service. It can be installed using pip:

pip install fastopendata-client

If you don’t have pip, you can clone the Github repository for the client, which is located at:

https://github.com/zacernst/fastopendata_client

Getting a free API key

To use this client to retrieve data, you must have an API key. You can use the client itself to get a free API key that’s suitable for evaluation purposes and for use-cases that aren’t too demanding. After you’ve installed the client, the following code will retrieve a new key:

>>> from fastopendata_client.client import FastOpenData
>>> api_key = FastOpenData.get_api_key('YOUR_EMAIL_ADDRESS')

This will both print your API key to the terminal and assign it to the api_key variable.

Save your API key somewhere; you will use it each time you invoke the FastOpenData client. If you lose your key, you can call this method again with the same email address to get a new one.

The free API key is rate limited and not suitable for production purposes. To subscribe to the FastOpenData service and receive a key that will provide unlimited access, please visit https://fastopendata.com. If you have questions about the service or your particular use-case, email zac@fastopendata.com.


Retrieving data

The main class for the FastOpenData client is FastOpenData, which is instantiated like so:

>>> from fastopendata_client.client import FastOpenData
>>> session = FastOpenData(api_key="<YOUR_API_KEY>")

If you do not provide a value for the api_key parameter, the client will try to find an API key in the environment variable FASTOPENDATA_API_KEY. If it cannot find an API key, the client will raise an exception.

Data is retrieved by calling the appropriate methods of session.

Requesting a single address

If you only want data for a single address, create a session with the FastOpenData class as above, and call the request method:

>>> data = session.request(free_form_query="123 Main Street, Tallahassee, FL, 12345")

Assuming that all is well, data now contains a dictionary with all the relevant data from the FastOpenData server.

Appending data to a Pandas dataframe

A common use-case is to have a Pandas dataframe, where each row contains information about a person or household, including an address. If you have such a dataframe, you can append new columns containing data from FastOpenData by calling FastOpenData.append_to_dataframe and providing the name(s) of column(s) which contain address information. For example, if COLUMN_NAME contains unstructured address strings, and your dataframe is df, you can do this:

>>> session.append_to_dataframe(df, free_form_query=COLUMN_NAME)

Now df contains many new columns containing data from the FastOpenData server.

If your dataframe has columns containing structured address information — that is, one column for the street address, another for the city, and so on — you can do:

>>> session.append_to_dataframe(
        df,
        address1=ADDRESS1_COLUMN,
        address2=ADDRESS2_COLUMN,
        city=CITY_COLUMN,
        state=STATE_COLUMN,
        zip_code=ZIP_CODE_COLUMN
    )

Note that you have the option of specifying either free_form_query or the column names for structured address data, but not both. Doing so will raise an exception.

Generally, it is better to use structured address information as in the previous example. Match rates are a little higher, and responses are a bit quicker. But we have also found that unstructured address information works well enough for the vast majority of use-cases.

Appending data to a CSV

Yet another common use-case is to have a CSV file where each row contains an address. Frequently, we want to append new columns to the CSV file containing additional information about each address.

For this scenario, you need to specify the path to your CSV file, the path to the new CSV file, the names of column(s) containing address information, and any options that you want to send to the CSV reader or writer. It works analogously to how we appended to a dataframe above, with the difference that the client will write a new CSV file that contains all the columns from the original file, plus new ones from FastOpenData.

Suppose you have a CSV file residing at /data/csv_file.csv which is delimited by pipes and contains unstructured address information in the address column. To write a new CSV file that is comma-delimited to /data/new_csv_file.csv containing data from FastOpenData, you would first create a session as in the examples above. Then you would execute the following:

>>> session.append_to_csv(
        "/data/csv_file.csv",
        "/data/new_csv_file.csv", 
        reader_options={delimiter: "|"}, 
        writer_options={delimiter: ","}
    )

Now you should have /data/new_csv_file.csv which contains all the columns from /data/csv_file.csv, plus many new columns from FastOpenData.

The dictionaries reader_options and writer_options will be passed verbatim as parameters to csv.DictReader and csv.DictWriter, respectively. They are optional, of course, if you are happy with the defaults.


Rolling your own client

If you like, you can download the OpenAPI specification for the service from the top of this page and use an automated tool to generate client code. But the following brief list should give you enough information to write your own client, or send ad-hoc requests with tools such as curl, wget, or Postman:

  • Your client should send GET requests to https://fastopendata.com:8000.
  • The route for retrieving information about a single address is /.
  • Your API key should be provided in the headers of your request, under the header x-api-key.
  • Address information is to be given as query parameters. For unstructured address information, use the parameter free_form_query. For structured address information, use:
    • address1
    • address2
    • city
    • state
    • zip_code
  • All responses from the server are JSON blobs, so you will also include the header Content-type: application/json.
  • The route for getting an API key is /get_free_api_key. It will send back a JSON blob with the key/value pair api_key: YOUR_API_KEY.

Putting it all together, here is an example of a curl request that uses a free-form query to retrieve information about the worldwide headquarters for FastOpenData:

curl -X GET "https://fastopendata.com:8000/?free_form_query=1984%20Lower%20Hawthorne%20Trail" \
    -H "x-api-key: <YOUR_API_KEY>" \
    -H "Content-Type: application/json"

Be sure to replace <YOUR_API_KEY> with the API key you received from the client (see above).

To get your own free API key for the service, use the following curl command:

curl -X GET "http://localhost:8000/get_free_api_key?email_address=<YOUR_EMAIL_ADDRESS>" \
    -H "Content-Type: application/json"

You will get back a JSON blob that looks like this:

{"api_key":"XXXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"}

Include the API key with every request that you make.


Understanding the data

Reading the responses from FastOpenData

After you send any United States address in a request to FastOpenData, you will receive a JSON response containing a large number of data points for that address. From the Python client, the data will be in a dictionary.

When you send an address to the FastOpenData API, a few things happen:

  1. The address is searched and normalized. E.g. “123 Main #1” might become “123 Main Street Apt. 1”.
  2. The normalized address is geolocated to find its latitude and longitude.
  3. Using the latitude and longitude, the service identifies all the geographic areas that contain the address, such as census tract, school district, congressional district, and so on.
  4. All the data points that are available for each geographic area are returned in a JSON blob or dictionary.

The service uses a number of different geographic area types:

  1. Census block group
  2. Census tract
  3. County
  4. Public Use Microdata Area (PUMA)
  5. Congressional district
  6. School district
  7. Core Based Statistical Area (CBSA)

Additionally, some of these geographies are taken from different years because the source data does not consistently use boundaries from any specific time — some quite recent data uses geographic boundaries that are several years old. Indeed, this is one of the many challenges associated with collecting and standardizing these data sets.

The data is aggregated to each of these geographic areas, depending on the data source and how it was collected. If you use the FastOpenData.request method (as in the first code example above), you will receive a dictionary. The top-level keys of the dictionary specify a geographic boundary such as tract, county, etc. Under each of these keys is a number of data points as key-value pairs. For example, this is (part of) the JSON response to a request for information about an address in a small town in rural Georgia:

{
"cbsa_2013": {
    "geoid": null,
    "homes_owned_or_bought_by_member_of_household_count": null,
    "homes_rented_count": null,
    ...},
"census_block_group_2019": {
    "households_owning_one_automobile_count": 146,
    "households_owning_one_automobile_percent": 0.0305623471883,
    "households_owning_zero_automobiles_count": 25,
    "housing_units_count": 1068,
    "geoid": "131319506001",
    ...},
"county": {
    "arthritis_percent": 0.259,
    "asthma_percent": 0.105,
    "binge_drinking_percent": 0.131,
    "cancer_percent": 0.057,
    "cervical_cancer_screening_percent": 0.809,
    "cholesterol_screen_past_year_percent": 0.866,
    "chronic_kidney_disease_percent": 0.034,
    "chronic_obstructive_pulmonary_disease_percent": 0.097,
    "geoid": "13131",
    "bridges_count": 104,
    "business_establishment_count": 414,
    ...},
...
}

As you can see, the top-level keys are geographic areas, namely, the core based statistical area from 2013, the census block group of 2019, and the county. Like many small towns, this particular address does not have a core based statistical area, so the values under that key are all null.

If you use the FastOpenData.append_to_dataframe method, the columns that are appended to the dataframe will have names of the form geography.data_point, as in county.bridges_count.

There are many data points, and the list of data points is growing. We have tried to keep to a consistent set of naming conventions, and we have opted to use longer, more descriptive names for the data points rather than shorter, terse names. We have held to the following naming conventions:

  • The names of data points that give a percent value end in _percent. Values are in the range from zero to one.
  • Data points that give a count of something (as in the number of residents or the number of businesses) have names ending in _count.
  • Every geographic area has a unique identifier which is always provided under the key geoid.

Data sources

FastOpenData collects data from a variety of open sources, including the United States government as well as several different open data projects. These sources include:

  • The Census Bureau
  • The Office of Management and Budget
  • OpenStreetMap
  • Wikidata
  • The Department of Transportation
  • The Department of Education
  • USASpending.gov
  • The Federal Election Commission

It is never necessary to specify the source(s) for the data you request. All available data from all sources is automatically returned for every request made to the API. Where a data point is not available, the API will return a value of null. The structure and fields in the API’s responses will always be the same.


Personally identifiable information

The data contains no personally identifiable information (PII), and all the data is available under a permissive license that allows you to use it for any legal purpose, including business purposes. Much of the data comes from surveys conducted by various government entities. This survey data is always aggregated to a large enough geographic area as to defeat any attempts to de-anonymize the data. The addresses in your requests are never stored by FastOpenData in logs or any other place, and data is encrypted in-transit because the server uses https.


Understanding geographic areas

It would be nice if there were a single type of geographic area we could use to aggregate all the data from each source. Unfortunately, this is not even close to being the case.

The most important types of geographic areas to understand are those which are used by the United States Census Bureau. The Census Bureau utilizes a hierarchy of geography types which have varying relationships to each other. It’s easiest to consider these geographies from smallest to largest.

The smallest geography that the Census Bureau uses is the census block. Think of these as the “atoms” of the Census Bureau, out of which all the other geographies are composed. Census blocks are small areas that are bordered by something visible — a road, stream, utility line, and so on. These units are so small that the Census Bureau doesn’t release very much information about their populations because there could be so few households that it might be possible to de-anonymize the data. So you won’t see census blocks referenced in the FastOpenData results.

A number of census blocks are bound together to form census block groups. Some data is available about census block groups, but not much. Again, this is owing to the possibility of de-anonymization.

Census tracts are composed of census block groups; they are used as building blocks for other geographic units. They do not, for example, cross county lines — that is, each census tract is either entirely contained within a county or entirely outside the county. FastOpenData has information that has been aggregated to census tracts, including information about languages spoken, certain demographic information, and some federal spending data.

Next we have public use microdata areas, which we usually refer to as PUMAs. These are statistical areas that are designed to preserve confidentiality when microdata (i.e. surveys of households and individuals) are collected. There are several strict criteria for defining PUMAs. First, they must encompass a population of at least 100,000. Second, they are composed of census tracts. Third, they cannot cross state boundaries. Fourth, when they cross a county boundary, each county-PUMA part must contain at least 10,000 people. That is, we cannot have a PUMA that intersects a county in an area with fewer than 10,000 people. And fifth, when a county has more than 200,000 people, it must be divided into at least two PUMAs.

There are also some guidelines that are followed whenever possible. The most important of which is that there should be as many PUMAs as possible. This implies that if there are more than 200,000 people in an area, then that area should be broken into at least two PUMAs. Also, PUMAs are intended to encompass populations that are as homogenous as possible. Therefore, they try not to break apart core-based statistical areas (see below), Native American reservations, and urban areas unnecessarily. Because PUMAs are supposed to be as homogenous as possible, our experience has been that data aggregated at the PUMA level is often the most predictive, even though it is not the most granular.

Core based statistical areas, or CBSAs, are the only geographic area that does not cover the entire United States; about half of the population (mostly in very rural areas) does not reside in any CBSA. If you send an address to FastOpenData that does not have a CBSA, you will receive null values for each of the data points that are aggregated to the CBSA level.

CBSAs are built from one or more counties; and since counties are composed of census tracts, CBSAs are also composed of census tracts. They are defined by the Office of Management and Budget, and are designed to encompass areas that have a high degree of economic cohesion. For example, patterns of commuting to work from home are taken into account when defining CBSAs.

Counties are also the building block of congressional districts. There are 435 congressional districts in the United States; they are contiguous, and within any particular state, the congressional districts are supposed to have as close to the same population as possible. They are defined by each state according to state laws and court rulings. Like congressional districts, school districts are determined by state and local laws and regulations. Their boundaries change frequently, and in many places, school districts operate like local governments — they may have the power to determine their own boundaries. Unlike congressional districts and other geographic areas, school districts’ boundaries do not necessarily have any relationship to any other boundaries. School districts can cross county lines, census tracts, and so on arbitrarily.

Finally, we get to zip codes. You won’t see any zip codes in FastOpenData responses. This may be surprising, because it is so common to use zip codes to aggregate data on households. Indeed, zip codes have one major advantage when it comes to data: they can simply be read off an address. However, there are multiple problems caused by using zip codes as geographic areas for the purpose of data collection and analysis. First, zip codes are determined by the postal service, which has the authority to change their boundaries at any time. Second, unlike other geographic areas that are used by FastOpenData, zip codes are not intended to take in demographically similar households. Thus, it is more problematic to attribute characteristics to any particular person or household based on statistics about the zip code. Third, there are actually two different types of zip code, and they do not necessarily match. Specifically, there are the zip codes that are defined by the postal service and which appear on addresses; but there is also the so-called zip code tabulation area (ZCTA) which is defined by the Census Bureau. ZCTAs look like ordinary zip codes, and they usually coincide with them, but they can differ arbitrarily if the postal service decides to update a zip code’s boundaries. The purpose of ZCTAs is to provide a more stable counterpart to the zip code for purposes of data collection.

All of this is why it is especially problematic to rely on zip codes that have been gleaned from addresses. It turns out that if you’re using data about zip codes, you’re probably actually using data that was compiled for ZCTAs, but attributing that data to the postal service’s zip codes — and those two entities can diverge. FastOpenData does not ever use the postal service’s zip codes, except for geolocating addresses.