Data Sources…

Data Sources…
Photo by Susan Q Yin on Unsplash

When I first started working with publicly accessible datasets, I would typically save them to an external hard drive. This lasted about six months before I realized that I had over 200GB of data. It may sound fine but my documentation of where the datasets orinated was not very good and often information about the encoding of the individual features was unclear. Without this information, I would have been just as good off by creating synthetic data of my own.

I also realized that the usefulness of the data or the validity of the data was often suspect. So in this article, I attempt to describe factors that influence if a data set is good or not. At the bottom of this article is my is my second attempt at aggregating my resources.


Thinking Like a Data Scientist

Below is a collection of data sources that might be helpful as you work on projects. Please note that you’re free to use datasets from other sources as well. I would advise that you start looking at datasets to become familiar with what is available and the possible limitations. Consider these questions:

  • What type of questions could I ask?
  • Can the data be broken into data subsets so I can discover a hidden story?
  • What type of graphs would tell the story well?
  • Most importantly, is the data quality good enough?

Data Quality

  • Data Breadth or Variety - different types of measurements - categorical, numerical, geographic, time (columns)
  • Data Quanity or Volume - the amount of data including repeated measurements (rows)
  • Data Relevancy or Velocity - the frequency or rate that the data changes or becomes outdated/irrelevant.
  • Missing and Outlying and Faulty Data or Veracity - the values that seem to unavailable, unreasonable or impossible
  • Story-Telling and Truth-Telling or Value - What type of story could come out of the investigation? What uses and applications could this have? What research questions could I pose and answer? What relationships are present? What predictions could be made? What is the significance?


Data Aggregations


Specific DataSets


Federal Data Portals

If the federal data is collected by the states and combined at a federal level then in many cases the data will not be collected in the same way. Each state often develops their own definitions and procedures which can make comparing state data confusing.

State Data Portals

City Data Portals

City Data Portals can be interesting but they can also be confusing because you need to investigate what each variable means, how it was collected, and was it modified from its original version. For example, rideshare data from taxi companies and app services is collected by many cities and data provided but often the location data and price data is modified to protect privacy. This is also a very large dataset and has very little preprocessing to correct errors. You may find a rideshare trip that lasts 20 hours and goes half way across the country.

University Research

This can be interesting data but often is a very small, specific sample which is not representative of the larger, broader populaton as a whole. This is often due to time, cost, and difficulty of obtaining data. This is why graduate students may take several years to collect all the data that they need for their PhD Thesis/Dissertation.


Common API Sources

API’s can be great and they can be unpredictable. One benefit is that you can typically get very recent information or even information that is refreshing almost instantenous. The down side is that the owner may provide very limited access for free or the data could be very messy. Only a couple of these have I seen used in a project. Here are some options:

  • OpenWeatherMap (class activity)
  • Yahoo Finance
  • OMDB (class activity)
  • IMDB
  • Quandl (later in the class activity)
  • Spotify
  • NY Times (class activity)
  • Yelp
  • Mapquest
  • TVMaze (class activity)
  • WorldBank (class activity)
  • Google API (class activity)
  • Census (class activity)
  • NASA API
  • Numbers


Machine Learning: Computer Vision

Examples practical only for the final project

This is a preview of Clap Button, a new feedback and analytics tools for Hydejack, built by yours truly. You can try it out on localhost for free, but it will be removed (together with this message) when building with JEKYLL_ENV=production. To use Clap Button on your site, get a subscription
and set clap_button: true in your config file.


This site is a modified version of Hydejack v9.1.4 created by Erin Wills.