For all the Machine Learning fans out there, here is a short list of various datasets released by Google over the years.
- Co-occurrence of words for word n-gram model training (translation, spelling correction, speech recognition):
- Job queue traces from Google clusters:
- 800M documents (search corpus) annotated with Freebase entities: blog post
- Wikilinks, 40M disambiguated mentions in 10M web pages linked to Wikipedia entities: blog post
- Human-judged corpus of binary relations about Wikipedia public figures (pairings of people to freebase concepts, annotated with supporting document and a human rater confidence): blog post data
- Wikipedia Infobox edit history (39M updates of attributes of 1.8M entities) blog post
- Triples of (phrase, URL of a Wikipedia entity, number of times phrase appears in the page at the URL) - useful for entity word dictionaries blog post
For other data sources, see the related discussion on HackerNews.