Data Sources
Here is a list of places where you can get free data; typically tablular in the form of CSV (comma-separated values), TSV (tab-separated values), or Excel files.
Parents- Data Science All science degrees at the university will require basic fluency in data acquisition and generating charts. Excel was a popular tool of the past, but Jupyter Notebooks are the current hot thing. Basic understanding of statistics and proving one's hypothesis is paramount.
Links
- The Home of the U.S. Government's Open Data - 250K searchable data sets publicly available.
- Google Data Search Engine - Unify tens of thousands of different repositories for datasets and make that data discoverable for everyone.
- World Bank Open Data - Free and open access to global development data. The World Bank funds initiatives in underdeveloped nations on a regular basis, then collects statistics to track their success.
- Awesome Public Datasets - This is a list of topic-centric public data sources in high quality. They are collected and tidied from blogs, answers, and user responses. Most of the data sets listed below are free, however, some are not.
- Making over 127.15TB of research data available - A website dedicated to the distribution of data sets from scholarly studies. It contains a plethora of intriguing data sets.
- UC Irvine Machine Learning Repository - The UCI Machine Learning Repository is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. The archive was created as an ftp archive in 1987 by UCI PhD student David Aha. Since that time, it has been widely used by students, educators, and researchers all over the world as a primary source of machine learning datasets.
- THE MNIST DATABASE of handwritten digits - The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting.
- The Chars74K dataset - Character Recognition in Natural Images - Character recognition is a classic pattern recognition problem for which researchers have worked since the early days of computer vision. With today's omnipresence of cameras, the applications of automatic character recognition are broader than ever. For Latin script, this is largely considered a solved problem in constrained situations, such as images of scanned documents containing common character fonts and uniform background. However, images obtained with popular cameras and hand held devices still pose a formidable challenge for character recognition. The challenging aspects of this problem are evident in this dataset.
- Face Databases
- KD Nuggets - Datasets for Data Science, Machine Learning, AI & Analytics
- A Comprehensive List of Open Data Portals from Around the World - DataPortals.org is the most comprehensive list of open data portals in the world. It is curated by a group of leading open data experts from around the world - including representatives from local, regional and national governments, international organisations such as the World Bank, and numerous NGOs.
- Sage Data - Billions of multidisciplinary, global statistics for research and instruction
- 25 Open Datasets for Deep Learning Every Data Scientist Must Work With