Find Data for AI

How to find Good Data for AI projects

Dear AI friends,

all you need is … DATA! And love, of course 🙂

Yes, I don’t joke. The fuel for any AI System is Data. No matter how clever the technologies are, they depend on data. More importantly, they depend on “good” data. If you have good Data, you have already solved 50% of your problem. Any AI System is “data-hungry” and can only be as smart as the information you provide it with.

So, before we start with clever ML algorithms, let’s ensure we know, how to find Data for your AI 😉

Building gold standard corpus is seriously a very hard work! Good resources cost a lot of money. One of the secrets, why ML is so popular today, is that it can work not only with structured, but also with unstructured data. “Luxury structured” or “unstructured” – we still need Data, and in this article, I want to show you some cool Datasets, which you can use for free 😀

Plan

  1. Web as a corpus – create your own corpora
  2.  Prepared corpora
    2.1. Accessing Corpora through NLTK
    2.2. General Corpora
    2.3. Dialogue, chat, email Datasets
    2.4. Sentiment Analysis
    2.5. Spam/Not Spam
    2.6. Other lists of free/public datasets

1. Web as a corpus – create your own corpora

One can ask: What’s the problem with Data? We do have Web. It’s the biggest available corpus, providing access to quadrillions of words! Yes, of course.

But 🙂

Not everything in Web is the kind of language you will want to learn/emulate. You’ll get data with mistakes from native and non-native speakers, different genres and purposes, different results on different days and a lot of gaps.

And what datasets we need? NLP works with structured or unstructured, electronically stored datasets, which are called corpora (Sgl. Corpus). The four modern characteristics of the modern corpus: “sampling and representativeness, finite (and usually fixed) size, machine-readable, a standard reference”. You see, web is not always a corpus.

But sometimes we need data with many innovations, computing-related terms and slang. Moreover, in Statistical NLP, one commonly receives a certain amount of data from a certain domain of interest, without having any information in how it is constructed. In such cases, having more training data is more useful than any concerns of balance. More data – better results. In this case Web is a good idea and you are welcome to use the following tools and build your own Web corpus:

KwiCFinder (Key Word in Context Finder) – a Web search concordance, displays the search words in their textual contexts. Works with Yahoo!

WebCorp – Concordances the Web.

GlossaNet – Retrieves words or sequences of words from a pre-selected pool of daily newspapers (French, English, Spanish, Italian, Portuguese). If any match occurs, a concordance is sent to the user by email (this is a list of the retrieved occurrences presented in their context (by default, 40 characters to the right & 40 characters to the left) in text or HTML format). You can set up GlossaNet so that concordances are sent to you on a weekly basis.

HighBeam Library – Search an archive of more than 35 million documents from over 3,000 sources – a vast collection of articles from leading publications, updated daily & going back as far as 20 years. Can restrict to: (1) Documents (from Newspapers, Magazines, Journals, Transcripts & Books), (2) Images & Maps , (3) Reference books (Encyclopedias, Dictionaries & Almanacs).

Twitter HOWTO – NLTK to collect and process live Twitter Data.

2. Prepared Corpora

As creating an own text Dataset needs a lot of time and resources, we need prepared corpora.

Before searching, it’s nice to understand, what exactly you are searching for 🙂

Text Corpora can be:

  • primary (and consist only of text information) or annotated / labeled / pre-processed / structured (and include various meta-data, annotations and features). This type of corpora is useful for ML methods.
  • Balanced (also known as sample corpora) try to represent a particular type of language over a specific span of time or unbalanced (monitor corpora, that grows over time; the relative proportions of different materials vary over the time and opportunistic corpora, representing nothing more no less than the data that it was possible to gather for a specific task (common for web and spoken recordings));
  • Mono-, bi-, multi-lingual; comparable (two or more languages, collected using the same sampling methods, e.g. the same proportions of genres, domains, the same period and so on) or parallel (source texts and their translations).

Other classifications:  general (standard texts), for Sentiment Analysis, Spam/not Spam sets, Social Media, Dialogue sets, Question-answer sets.

ML uses annotated / labeled/ preprocessed/ structured datasets. If you have primary data, you can do all annotations & preprocessing steps yourself, using special tools. Old NLP Methods work only with structured information, ML can work also with unstructured data. Balanced corpora is better, that unbalanced, but it’s really hard to find balanced data. This is the case, when quantity is sometimes better, than quality.

2.1. Accessing Corpora through NLTK Tool

Wikipedia (NLTK Tool)
NLTK Documentation

“Relax, take it easy…” – yes, it’s about NLTK. 🙂
NLTK (Natural Language Toolkit) is a platform for building Python programs to work with human language data. Here you can easily find a lot of Datasets (access over 50 corpora and lexical resources and a lot of text processing libraries).
Here is the whole List of NLTK Corpora.
You can download all these corpora in Internet, but if you want to use NLTK libraries, it’s better to learn, how to get access over data through NLTK.

How to use NLTK Toolkit?

1. First of all, install NLTK

2. Once you’ve installed NLTK, start up the Python interpreter and install the data by typing the following two commands at the Python prompt, then selecting the book collection:
import nltk
nltk.download()

3. Downloading the NLTK Book Collection: browse the available packages using nltk.download(). The Collections tab on the downloader shows how the packages are grouped into sets, and you should select the line labeled book to obtain all data required for the examples and exercises in this book. It consists of about 30 compressed files requiring about 100Mb disk space. The full collection of data (i.e., all in the downloader) is nearly ten times this size and continues to expand.

4. Once the data is downloaded to your machine, you can load some of it using the Python interpreter. The first step is to type a special command at the Python prompt which tells the interpreter to load some texts for us to explore: from nltk.book import *.

In nltk online book you can find some further instructions for working with nltk corpora.

2.2. General Corpora:

Web1 T corpus (2006, 1 billion words, different languages)

British National Corpus (100 Million Words, British English)
Better interface

DWDS Kernkorpus (100 Million Words, German XX century)

Some general corpora, available in NLTK:

Brown Corpus – 500 samples of English-language text, totally roughly one million words, distributed across 15 genres in rough proportion to the amount published in 1961 in each of these genres, compiled from works published in the United States in that year.

Project Gutenberg Selections – Selections from Project Gutenberg (a volunteer effort to digitize cultural works and is the oldest digital library) by the following 12 authors: Jane Austen (3), William Blake (2), Thornton W. Burgess, Sarah Cone Bryant, Lewis Carroll, G. K. Chesterton (3), Maria Edgeworth, King James Bible, Herman Melville, John Milton, William Shakespeare (3), Walt Whitman.

MASC Tagged Corpus – Open American National Corpus. MASC is a balanced subset of 500K words of written texts and transcribed speech drawn primarily from the Open American National Corpus (OANC). The OANC is a 15 million word (and growing) corpus of American English produced since 1990, all of which is in the public domain or otherwise free of usage and redistribution restrictions.

Penn Treebank Corpus -The Penn Treebank (PTB) project selected 2,499 stories from a three year Wall Street Journal (WSJ) collection of 98,732 stories for syntactic annotation. These 2,499 stories have been distributed in both Treebank-2 (LDC1999T42) and Treebank-3 (LDC1999T42) releases of PTB.

WordNet – large lexical database in English

genesis – Bible text (in NLTK)

CoNLL2000 – part of speech and chunk annotated corpus (in NLTK)

CoNLL2002 – NER and part of speech and chunk annotated corpus (in NLTK)

Information Extraction and Entity Recognition Corpus (in NLTK)

2.3. Dialogue, chat, email Datasets:

Let’s pay more attention at special type of text corpora – Dialogue Datasets. They can be used to learn diverse dialogue strategies for Data-Driven Dialogue Systems.

Switchboard (2.4 Million Words, Dialogue Domain, American English) – A corpus of over 240 hours of recorded spontaneous (but topic-prompted) telephone conversations (2,438 conversations, averaging 6 minutes in length) recorded in the early 1990s.
1 version
2 version 

Microsoft Dialogue Dataset based on booking a vocation

Ubuntu Chat Corpus – two-person multi-turn unlabelled conversations extracted from the Ubuntu chat logs, used to receive technical support for various Ubuntu-related problems.

Enron Email Dataset – 0.5M email messages.

SRI American Express travel agent dialogue corpus – A corpus of actual travel agent interactions with client callers, consisting of 21 tapes containing between 2–9 calls each.
Santa Barbara corpus – is an interesting one because it’s a transcription of spoken dialogues.

Avaliable in NLTK:
NPS Chatcorpus – posts from age-specific online chat rooms

Other Dialogue Dataset Collections:
https://breakend.github.io/DialogDatasets/
http://freeconnection.blogspot.de/2016/04/conversational-datasets-for-train.html

2.4. Sentiment Analysis:

IMDB Movie Reviews – 50.000 annotated IMDB movie reviews

Multi-Domain Sentiment Dataset – contains product reviews taken from Amazon.com from 4 product types (domains): Kitchen, Books, DVDs, and Electronics

Sanders Analytics Twitter Sentiment Corpus – 5513 hand-classified tweets

Available in NLTK:
Opinion Lexicon – Curated list of positive/negative words

Movie Reviews – 2000 Sentiment annotated movie reviews

2.5. Spam / not Spam:

SMS Spam Dataset
Email Spam Dataset
Youtube comment Spam Dataset
Untroubled Spam Archive

2.6. Summarization:

Text Analysis Conference 

British Columbia Conversation Corpora

British Columbia Conversation Corpora

CMU Movie Summary Corpus

Multilingual Summarization

2.7. Question-Answering Datasets:

Li and Roth Question Classification dataset – Collection of questions (without answers) from the TREC conference datasets

TREC Question-Answering Collections

Biomedical Semantic Indexing and QA

A big collection of different QA Datasets

 

2.8. Datasets for Named Entity Recognition and Disambiguation

Click to access public.pdf

https://www.cs.york.ac.uk/semeval-2013/task11/index.php%3Fid=data.html

Annotated Corpus for Named Entity Recognition (27 MB): Corpus (CoNLL 2002) annotated with IOB and POS tags https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus

geo = Geographical Entity, org = Organization, per = Person, gpe = Geopolitical Entity, tim = Time indicator, art = Artifact, eve = Event, nat = Natural Phenomenon

Open Dataset (free)

CoNLL-2003 Shared Task: https://www.clips.uantwerpen.be/conll2003/ner/

English, German

persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups

The English data is a collection of news wire articles from the Reuters Corpus. “…because of copyright reasons we only make available the annotations. In order to build the complete data sets you will need access to the Reuters Corpus. It can be obtained for research purposes without any charge from NIST. The German data is a collection of articles from the Frankfurter Rundschau. The named entities have been annotated by people of the University of Antwerp. Only the annotations are available here. In order to build these data sets you need access to the ECI Multilingual Text Corpus. It can be ordered from the Linguistic Data Consortium (2003 non-member price: US$ 35.00). “

AMR Release 1.0: https://catalog.ldc.upenn.edu/LDC2014T12 , AMR Release 2.0: https://catalog.ldc.upenn.edu/LDC2017T10

English

Data: discussion forums collected for the DARPA BOLT and DEFT programs, transcripts and English translations of Mandarin Chinese broadcast news programming from China Central TV, Wall Street Journal text, translated Xinhua news texts, various newswire data from NIST OpenMT evaluations and weblog data used in the DARPA GALE program

For LDC member (https://www.ldc.upenn.edu/language-resources/data/obtaining )

AIDA CoNLL-YAGO Dataset (419 KB, 156 MB): https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/aida/downloads/

For Robust Disambiguation of Named Entities

Free (https://creativecommons.org/licenses/by-sa/3.0/deed.en_US )

AIDA-EE Dataset (119 KB): https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/aida/downloads/

For Discovering Emerging Entities with Ambiguous Names

Free (https://creativecommons.org/licenses/by-sa/3.0/deed.en_US )

KORE Datasets (5 + 5 KB): https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/aida/downloads/

Hand-crafted difficult sentences with a large number of very ambiguous mentions

Free (https://creativecommons.org/licenses/by-sa/3.0/deed.en_US )

IBEX: Id-Based Entity Extraction: https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/ibex/

focus on entities with unique identifiers. These are, e.g., ISBNs (for books), GTINs (for commercial products), DOIs (for documents), email addresses, and others

state-of-art results: Chemical substances, Chemical formulas, documents, emails, products Data

YAGO (19 Gb compressed) https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/

persons, organizations, cities, etc

free

Microposts for Named Entity rEcognition and Linking (NEEL) Challenge: http://microposts2016.seas.upenn.edu/challenge.html

Events Entities from Tweets (over 18 million tweets – covering multiple noteworthy events from 2011, 2013 (including the death of Amy Winehouse, the London Riots, the Oslo bombing and the Westgate Shopping Mall shootout), tweets extracted from the Twitter firehose from 2014 and 2015 via a selection of hashtags)

CHEMDNER task of BioCreative IV (18,3 MB): http://mldata.org/repository/tags/data/Named/

Drug names recognition task

MASC (500K words of written texts and transcribed speech drawn primarily from the Open American National Corpus (OANC)): http://www.anc.org/data/masc/corpus/

person, location, organization, date and time

WordNet – a large lexical database of English : https://wordnet.princeton.edu/download

specific persons, countries and geographic entities

may be used in commercial applications in accordance with the license agreement (https://wordnet.princeton.edu/license-and-commercial-use )

2.9. Pre-processed Datasets for Text Classification:

20 Newsgroups Dataset+ Reuters 21578 + Cade12 + WebKB:  http://ana.cachopo.org/datasets-for-single-label-text-categorization

3.0. Other lists of free/public datasets:

https://github.com/niderhoff/nlp-datasets
https://github.com/karthikncode/nlp-datasets
https://github.com/caesar0301/awesome-public-datasets#natural-language
https://deeplearning4j.org/opendata

If you know interesting corpora, you can leave a comment here or use the contact form.

One thought on “Find Data for AI”

Leave a comment