Data Sources…

When I first started working with publicly accessible datasets, I would typically save them to an external hard drive. This lasted about six months before I realized that I had over 200GB of data. It may sound fine but my documentation of where the datasets orinated was not very good and often information about the encoding of the individual features was unclear. Without this information, I would have been just as good off by creating synthetic data of my own.
I also realized that the usefulness of the data or the validity of the data was often suspect. So in this article, I attempt to describe factors that influence if a data set is good
or not. At the bottom of this article is my is my second attempt at aggregating my resources.
Thinking Like a Data Scientist
Below is a collection of data sources that might be helpful as you work on projects. Please note that you’re free to use datasets from other sources as well. I would advise that you start looking at datasets to become familiar with what is available and the possible limitations. Consider these questions:
- What type of questions could I ask?
- Can the data be broken into data subsets so I can discover a hidden story?
- What type of graphs would tell the story well?
Most importantly, is the data quality good enough?
Data Quality
Data Breadth or Variety
- different types of measurements - categorical, numerical, geographic, time (columns)Data Quanity or Volume
- the amount of data including repeated measurements (rows)Data Relevancy or Velocity
- the frequency or rate that the data changes or becomes outdated/irrelevant.Missing and Outlying and Faulty Data or Veracity
- the values that seem to unavailable, unreasonable or impossibleStory-Telling and Truth-Telling or Value
- What type of story could come out of the investigation? What uses and applications could this have? What research questions could I pose and answer? What relationships are present? What predictions could be made? What is the significance?
Data Aggregations
- Public API’s
- Public Datasets
- DataHub
- Google Dataset Search
- OpenML
- WHO
- FiveThirtyEight
- Amazon Open Data
- Gapminder
- Data World
- Data.gov
- Industry Data (UK)
- Earth Data
- UCI Machine Learning Repository
- CERN OpenData
- Kaggle Examples
Specific DataSets
- Billboard Top 100 Weekly 1959+
- Boston Housing Data
- Netflix Shows
- Fashion Images (Machine Learning)
- Spotify Song Charactristics
- Spotify Top Songs Decade
- World Happiness Report
Federal Data Portals
If the federal data is collected by the states and combined at a federal level then in many cases the data will not be collected in the same way. Each state often develops their own definitions and procedures which can make comparing state data confusing.
State Data Portals
City Data Portals
City Data Portals can be interesting but they can also be confusing because you need to investigate what each variable means, how it was collected, and was it modified from its original version. For example, rideshare data from taxi companies and app services is collected by many cities and data provided but often the location data and price data is modified to protect privacy. This is also a very large dataset and has very little preprocessing to correct errors. You may find a rideshare trip that lasts 20 hours and goes half way across the country.
University Research
This can be interesting data but often is a very small, specific sample which is not representative of the larger, broader populaton as a whole. This is often due to time, cost, and difficulty of obtaining data. This is why graduate students may take several years to collect all the data that they need for their PhD Thesis/Dissertation.
Common API Sources
API’s can be great and they can be unpredictable. One benefit is that you can typically get very recent information or even information that is refreshing almost instantenous. The down side is that the owner may provide very limited access for free or the data could be very messy. Only a couple of these have I seen used in a project. Here are some options:
- OpenWeatherMap (class activity)
- Yahoo Finance
- OMDB (class activity)
- IMDB
- Quandl (later in the class activity)
- Spotify
- NY Times (class activity)
- Yelp
- Mapquest
- TVMaze (class activity)
- WorldBank (class activity)
- Google API (class activity)
- Census (class activity)
- NASA API
- Numbers
Machine Learning: Computer Vision
Examples practical only for the final project