Data Sources…

17 Mar 2023 Article on Overview

Photo by Susan Q Yin on Unsplash

When I first started working with publicly accessible datasets, I would typically save them to an external hard drive. This lasted about six months before I realized that I had over 200GB of data. It may sound fine but my documentation of where the datasets orinated was not very good and often information about the encoding of the individual features was unclear. Without this information, I would have been just as good off by creating synthetic data of my own.

I also realized that the usefulness of the data or the validity of the data was often suspect. So in this article, I attempt to describe factors that influence if a data set is good or not. At the bottom of this article is my is my second attempt at aggregating my resources.

Thinking Like a Data Scientist

Below is a collection of data sources that might be helpful as you work on projects. Please note that you’re free to use datasets from other sources as well. I would advise that you start looking at datasets to become familiar with what is available and the possible limitations. Consider these questions:

What type of questions could I ask?
Can the data be broken into data subsets so I can discover a hidden story?
What type of graphs would tell the story well?
Most importantly, is the data quality good enough?

Data Quality

Data Breadth or Variety - different types of measurements - categorical, numerical, geographic, time (columns)
Data Quanity or Volume - the amount of data including repeated measurements (rows)
Data Relevancy or Velocity - the frequency or rate that the data changes or becomes outdated/irrelevant.
Missing and Outlying and Faulty Data or Veracity - the values that seem to unavailable, unreasonable or impossible
Story-Telling and Truth-Telling or Value - What type of story could come out of the investigation? What uses and applications could this have? What research questions could I pose and answer? What relationships are present? What predictions could be made? What is the significance?

Data Aggregations

Specific DataSets

Federal Data Portals

If the federal data is collected by the states and combined at a federal level then in many cases the data will not be collected in the same way. Each state often develops their own definitions and procedures which can make comparing state data confusing.

United States

CDC

NOAA

NASA

State Data Portals

Minnesota Health Data

City Data Portals

City Data Portals can be interesting but they can also be confusing because you need to investigate what each variable means, how it was collected, and was it modified from its original version. For example, rideshare data from taxi companies and app services is collected by many cities and data provided but often the location data and price data is modified to protect privacy. This is also a very large dataset and has very little preprocessing to correct errors. You may find a rideshare trip that lasts 20 hours and goes half way across the country.

Chicago

Boston

New York

Los Angeles

For an Extensive List

University Research

This can be interesting data but often is a very small, specific sample which is not representative of the larger, broader populaton as a whole. This is often due to time, cost, and difficulty of obtaining data. This is why graduate students may take several years to collect all the data that they need for their PhD Thesis/Dissertation.

Common API Sources

API’s can be great and they can be unpredictable. One benefit is that you can typically get very recent information or even information that is refreshing almost instantenous. The down side is that the owner may provide very limited access for free or the data could be very messy. Only a couple of these have I seen used in a project. Here are some options:

OpenWeatherMap (class activity)

Yahoo Finance

OMDB (class activity)

IMDB

Quandl (later in the class activity)

Spotify

NY Times (class activity)

Yelp

Mapquest

TVMaze (class activity)

WorldBank (class activity)

Google API (class activity)

Census (class activity)

NASA API

Numbers

Machine Learning: Computer Vision

Examples practical only for the final project

Chest X-Ray Pneumonia Images (Machine Learning)

Handwriting Samples (Machine Learning)

Captcha Images (Machine Learning)

This is a preview of Clap Button, a new feedback and analytics tools for Hydejack, built by yours truly. You can try it out on localhost for free, but it will be removed (together with this message) when building with JEKYLL_ENV=production. To use Clap Button on your site, get a subscription
and set clap_button: true in your config file.

Data Sources…

Thinking Like a Data Scientist

Data Quality

Federal Data Portals

State Data Portals

City Data Portals

University Research

Common API Sources

Machine Learning: Computer Vision

ERIN WILLS

Error

Data Sources…

Templates (for web app):

Error