Datasets for Data Mining
Posted by Fateh BEKIOUA on janvier 11th, 2012
Pour ceux qui cherchent des jeux de données pour s’entrainer a appliquer les techniques de data mining je vous propose cette liste tiré spécialement du fameux portail Kdnuggets.Com
Data repositories
- KDD Cup center, with all data, tasks, and results.
- UCI KDD Database Repository for large datasets used in machine learning and knowledge discovery research.
- UCI Machine Learning Repository.
- AWS (Amazon Web Services) Public Data Sets, provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications.
- Bioassay data, described in Virtual screening of bioassay data, by Amanda Schierz, J. of Cheminformatics, with 21 Bioassay datasets (Active / Inactive compounds) available for download.
- Canada Open Data, pilot project with many government and geospatial datasets.
- Causality Workbench data repository.
- Data Source Handbook, A Guide to Public Data, by Pete Warden, O’Reilly (Jan 2011).
- Data.gov.uk, publicly available data from UK (also London datastore.)
- DataMarket, visualize the world’s economy, societies, nature, and industries, with 100 million time series from UN, World Bank, Eurostat and other important data providers.
- Datamob, public data put to good use.
- DataSF.org, a clearinghouse of datasets available from the City & County of San Francisco, CA.
- DataFerrett, a data mining tool that accesses and manipulates TheDataWeb, a collection of many on-line US Goverment datasets.
- Delve, Data for Evaluating Learning in Valid Experiments
- EconData, thousands of economic time series, produced by a number of US Government agencies.
- Enron Email Dataset, data from about 150 users, mostly senior management of Enron.
- FEDSTATS, a comprehensive source of US statistics and more
- FIMI repository for frequent itemset mining, implementations and datasets.
- Financial Data Finder at OSU, a large catalog of financial data sets
- GEO (GEO Gene Expression Omnibus), a gene expression/molecular abundance repository supporting MIAME compliant data submissions, and a curated, online resource for gene expression data browsing, query and retrieval.
- GeoDa Center, geographical and spatial data.
- Google ngrams datasets, text from millions of books scanned by Google.
- Grain Market Research, financial data including stocks, futures, etc.
- Hilary Mason research-quality Big Data sets collection - many text and image datasets.
- ICWSM-2009 dataset contains 44 million blog posts made between August 1st and October 1st, 2008.
- Infobiotics PSP (protein structure prediction) datasets, adjustable real-world family of benchmarks for testing the scalability of classification/regression methods.
- Infochimps, an open catalog and marketplace for data. You can share, sell, curate, and download data about anything and everything.
- Investor Links, includes financial data
- Kevin Chai list of datasets, for text, SNA, and other fields.
- MIT Cancer Genomics gene expression datasets and publications, from MIT Whitehead Center for Genome Research.
- ML Data, the data repository of the EU Pascal2 networks.
- NASDAQ Data Store, provides access to market data.
- National Government Statistical Web Sites, data, reports, statistical yearbooks, press releases, and more from about 70 web sites, including countries from Africa, Europe, Asia, and Latin America.
- National Space Science Data Center (NSSDC), NASA data sets from planetary exploration, space and solar physics, life sciences, astrophysics, and more.
OpenData from Socrata, access to over 10,000 datasets including business, education, government, and fun.- Peter Skomoroch dataset Bookmarks
- PubGene(TM) Gene Database and Tools, genomic-related publications database
- Robert Schiller data on housing, stock market, and more from his book Irrational Exuberance.
- SMD: Stanford Microarray Database, stores raw and normalized data from microarray experiments.
- SourceForge.net Research Data, includes historic and status statistics on approximately 100,000 projects and over 1 million registered users’ activities at the project management web site.
- StatLib, CMU Datasets Archive.
- STATOO Datasets part 1 and STATOO Datasets part 2
- Time Series Data Library
- UCR Time Series Data Archive, offering datasets, papers, links, and code.
- United States Census Bureau.
- Wikipedia User Contribution Dataset, prepared for an ongoing study on user reputation and content quality in Wikipedia at UCI.
- Wikiposit, a (virtual) amalgamation of (mostly financial) data from many different sites, allowing users to merge data from different sources
Yahoo Sandbox datasets, Language, Graph, Ratings, Advertising and Marketing, Competition
Université d’EDUNBERGH
http://www.inf.ed.ac.uk/teaching/courses/dme/html/datasets0405.html
Posted in Data Mining | 3 Commentaires »
