For database updates, follow or
updated 03.29.20
Want to add a dataset, edit?  
Dataset Added Language Description Instances Format Task Created Creator Download
1.5 billion Words Arabic Corpus 03.29.20 Arabic The data were collected from newspaper articles in ten major news sources from eight Arabic countries, over a period of fourteen years. 5M XML Text Corpora 2016 El-khair et al.
A Conversational Question Answering Challenge (CoQA) 01.15.20 English Dataset for measuring the ability of machines to understand a text passage and answer a series of interconnected questions that appear in a conversation. 127,000+ JSON Question Answering, Reading Comprehension 2019 Redy et al.
A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning (CLEVR & CoGenT) 01.29.20 English Visual question answering dataset contains 100,000 images and 999,968 questions. 999,968 questions; 100,000 images JSON Question Answering, Visual 2016 Johnson et al.
A Novel Approach to a Semantically-Aware Representation of Items (NASARI) 02.16.20 Multi-Lingual Dataset contains semantic vector representations for BabelNet synsets and Wikipedia pages in several languages: English, Spanish, French, German and Italian. Currently available three vector types: lexical, unified and embedded. 610K-4.4M depending on language Text Semantic Similarity 2016 Camacho-Collados et al.
A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs (DROP) 01.15.20 English Dataset is used to resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting). 96,000 JSON Question Answering, Reading Comprehension 2019 Dua et al.
ABC Australia News Corpus 01.15.20 English Entire news corpus of ABC Australia from 2003 to 2017. 1,103,664 CSV Clustering, Events, Sentiment Analysis 2017 Kulkarni
AG News 02.06.20 English Dataset contains more than 1 million news articles for topic classification. The 4 classes are: World, Sports, Business, and Sci/Tech. 1M+ CSV Classification 2015 Zhang et al.
AI2 Reasoning Challenge (ARC) 01.15.20 English Dataset contains 7,787 genuine grade-school level, multiple-choice science questions. 7,787 JSON, CSV Question Answering, Reading Comprehension 2018 Clark et al.
AI2 Science Questions Mercury 01.15.20 English Dataset consists of questions used in student assessments across elementary and middle school grade levels. Includes questions with diagrams and without. 6,940 JSON, JPG Reading Comprehension 2017 Allen Institute
AI2 Science Questions v2.1 01.15.20 English Dataset consists of questions used in student assessments in the United States across elementary and middle school grade levels. Each question is 4-way multiple choice format and may or may not include a diagram element. 5,060 JSON, CSV Question Answering, Reading Comprehension 2017 Allen Institute
AQuA 01.15.20 English Dataset containing algebraic word problems with rationales for their answers. 100,000 JSON Question Answering, Reading Comprehension 2017 Ling et al.
ASTD: Arabic Sentiment Tweets Dataset 03.29.20 Arabic Dataset contains over 10k Arabic sentiment tweets classified into 4 classes: subjective positive, subjective negative, subjective mixed, and objective. 10,000+ Text Classification, Sentiment Analysis 2015 Nabil et al.
ASU Twitter Dataset 01.15.20 English Twitter network data, not actual tweets. Shows connections between a large number of users. 11,316,811 users, 85,331,846 connections CSV Clustering, Graph Analysis 2009 Zafarani et al.
ATIS 02.16.20 English Dataset is a collection of utterances to a flight booking system, accompanied by a relational database and SQL queries to answer the questions. 877 JSON Semantic Parsing, Text-to-SQL 2017 Dahl/Iyer et al.
Abductive Natural Language Inference (aNLI) 01.29.20 English Dataset is a binary-classification task, the goal is to pick the most plausible explanatory hypothesis given two observations from narrative contexts. It contains 20k commonsense narrative contexts and 200k explanations." 20,000 JSON Classification, Commonsense 2019 Bhagavatula et al.
Academic 02.16.20 English Questions about the Microsoft Academic Search (MAS) database, derived by enumerating every logical query that could be expressed using the search page of the MAS website and writing sentences to match them. 196 JSON Semantic Parsing, Text-to-SQL 2014 Li et al.
Activitynet-QA 01.15.20 English Dataset contains 58,000 human-annotated QA pairs on 5,800 videos derived from the popular ActivityNet dataset. The dataset provides a benckmark for testing the performance of VideoQA models on long-term spatio-temporal. 58,000 JSON Question Answering, Visual, Commonsense 2019 Yu et al.
Advising 02.16.20 English Dataset contains questions regarding course information at the University of Michigan, but with fictional student records. 4,570 JSON Semantic Parsing, Text-to-SQL 2018 Finegan-Dollak et al.
Affective Text 01.21.20 English Classification of emotions in 250 news headlines. Categories: anger, disgust, fear, joy, happiness, sadness, surprise. 250 SGML, Text Emotion Classification 2007 Strapparava et al.
Amazon Fine Food Reviews 01.15.20 English Dataset consists of reviews of fine foods from amazon. 568,454 CSV Classification, Sentiment Analysis 2013 McAuley et al.
Amazon Reviews 01.15.20 English US product reviews from Amazon. 233.1M JSON Classification, Sentiment Analysis 2018 McAuley et al.
An Open Information Extraction Corpus (OPIEC) 01.15.20 English OPIEC is an Open Information Extraction (OIE) corpus, constructed from the entire English Wikipedia containing more than 341M triples. 341M AVRO Knowledge Base, Information Extraction, Knowledge Base 2019 Gashteovski et al.
Arabic Jordanian General Tweets (AJGT) 03.29.20 Arabic Dataset consists of 1,800 tweets annotated as positive and negative. Modern Standard Arabic (MSA) or Jordanian dialect. 1,800 Excel Classification, Sentiment Analysis 2017 Alomari
Arabic Reading Comprehension Dataset (ARCD) 03.05.20 Arabic Dataset contains 1,395 questions posed by crowdworkers on Wikipedia articles, and a machine translation of the Stanford Question Answering Dataset (Arabic-SQuAD) containing 48,344 questions. ~50,000 JSON Question Answering, Reading Comprehension 2019 Mozannar et al.
Arabic Speech Corpus 03.29.20 Arabic Dataset was recorded in south Levantine Arabic (Damascian accent) using a professional studio. Synthesized speech as an output using this corpus has produced a high quality, natural voice. n/a WAV, LAB Speech Corpora 2016 Halabi
Arabic Violence Twitter Corpus 02.16.20 Arabic Annotated Arabic tweets which mention a violent act. Tweets were classifed into 8 classes: Crime, Accident, Crisis, Conflict, Human Rights Abuse, Violence, Opinion, or other. Requires using Twitter API to match IDs with tweets for retrieval. 20,000 Text Classification 2016 Ayman et al.
ArabicWeb16 03.29.20 Arabic Dataset contains 150,211,934 Arabic Web pages with high coverage of dialectal Arabic as well as Modern Standard Arabic (MSA). 150M WARC Text Corpora 2016 Suwaileh et al.
Argentinian Spanish [es-ar] Speech Multi-Speaker Dataset. 01.21.20 Spanish (Argentinan) Speech dataset containing about 5,900 transcribed high-quality audio from Argentinian Spanish [es-ar] sentences recorded by volunteers. ~5,900 Wav Speech Recognition 2018 Google
ArguAna TripAdvisor Corpus 03.05.20 English Dataset contains 2,100 hotel reviews balanced with respect to the reviews’ sentiment scores. reviews are segmented into subsentence-level statements that have been manually classified as a fact, a positive, or a negative opinion. 2,100 XMI Classification, Sentiment Analysis 2014 Wachsmuth et al.
Aristo Tuple KB 01.15.20 English Dataset contains a collection of high-precision, domain-targeted (subject,relation,object) tuples extracted from text using a high-precision extraction pipeline, and guided by domain vocabulary constraints. 282,594 TSV Knowledge Base 2017 Dalvi et al.
AudioSet 01.15.20 Multi-Lingual Dataset consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos. n/a CSV, TFR Speech Recognition, Visual 2017 Google
Automated Essay Scoring 01.15.20 English Dataset contains student-written essays with scores. n/a TSV, xlsx Scoring Classification 2017 The Hewlett Foundation
Automatic Keyphrase Extraction 01.15.20 English Multiple datasets for automatic keyphrase extraction. n/a Multiple Information Retrieval 1999-2008 Several
BSNLP-2019 03.29.20 Multi-Lingual Dataset used to classify named entities in web documents in Slavic languages, their lemmatization, and cross-language matching. Dataset covers 4 languages: Bulgarian, Czech, Polish, and Russian. n/a Text, OUT Named Entiry Recognition (NER), Entity Linking 2019 Piskorski et al.
Background Knowledge Dialogue Dataset 03.05.20 English Dataset containing movie chats wherein each response is explicitly generated by copying and/or modifying sentences from unstructured background knowledge such as plots, comments and reviews about the movie. 90,000 JSON Dialogue 2018 Moghe et al.
Bianet 03.05.20 Multi-Lingual  Dataset is a parallel news corpus with 3,214 Turkish articles with their sentence-aligned Kurdish or English translations from the Bianet online newspaper. Requires a request submission for dataset. 3,214 XML Machine Translation 2018 Ataman et al.
Bible Corpus 03.05.20 Multi-Lingual A parallel corpus created from translations of the Bible containing 102 languages. 2.84M XML Machine Translation 2014 Christodoulopoulos et al.
BlogFeedback Dataset 01.15.20 English Dataset to predict the number of comments a post will receive based on features of that post. 60,021 Text Regression 2014 Buza
Blogger Authorship Corpus 01.15.20 English Blog post entries of 19,320 people from 681,288 Text Classification, Sentiment Analysis 2006 Schler et al.
Books Corpus 02.06.20 Multi-Lingual Dataset contains a collection of copyright free books. Corpus consists of 16 languages and 0.91M sentence fragments and 19.50M tokens. 0.91M XCES, XML Machine Translation 2012 Tiedemann
BoolQ 01.15.20 English Question answering dataset for yes/no questions. 15,942 JSON Binary Question Answering 2019 Clark et al.
Break 02.16.20 English Dataset contains 83,978 examples sampled from 10 question answering datasets over text, images and databases. Dataset used to obtain the Question Decomposition Meaning Representation (QDMR) for questions. 83,978 CSV Natural Question Understanding (NQU) 2020 Wolfson et al.
Buzz in Social Media Dataset 01.15.20 English Data from Twitter and Tom's Hardware. This dataset focuses on specific buzz topics being discussed on those sites. 140,000 Text Classification 2013 Kawala et al.
CAPES 03.05.20 English, Portuguese A parallel corpus of theses and dissertation abstracts in Portuguese and English from CAPES. 2.32M XML Machine Translation 2012 Tiedemann et al.
CASS 03.05.20 French Dataset is composed of decisions made by the French Court of cassation and summaries of these decisions made by lawyer. 129,445 XML Summarization 2019 Bouscarrat et al.
CCMatrix 02.16.20 Multi-Lingual 4.5 billion parallel sentences in 576 language pairs pulled from snapshots of the CommonCrawl public dataset. 4.5B to be added soon Machine Translation 2019 Schwenk et al.
CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) 01.15.20 English Dataset contains more than 23,500 sentence utterance videos from more than 1000 online YouTube speakers. The dataset is gender balanced. All the sentences utterance are randomly chosen from various topics and monologue videos. 23,500 n/a Sentiment Analysis, Emotion Recognition, Visual 2018 MultiComp Lab
CNN / Daily Mail Dataset 01.15.20 English Cloze-style reading comprehension dataset created from CNN and Daily Mail news articles. 1M+ Question Question Answering, Reading Comprehension 2015 Hermann et al.
COVID-19 Open Research Dataset (CORD-19) 03.29.20 English Dataset contains 44,000 scholarly articles, including over 29,000 with full text, about COVID-19 and the coronavirus family of viruses for use by the global research community. 44,000 JSON Text Corpora 2020 Allen Institute
COmmonsense Dataset Adversarially-authored by Humans (CODAH) 01.15.20 English Commonsense QA in the sentence completion style of SWAG. As opposed to other automatically generated NLI datasets, CODAH is adversarially constructed by humans who can view feedback from a pre-trained model and use this information to design challenging commonsense questions. 2,776 TSV Question Answering, Reading Comprehension, Commonsense 2019 Chen et al.
Car Evaluation Dataset 01.15.20 English Car properties and their overall acceptability. 1,728 Text Classification 1997 Bohanec
Children’s Book Test (CBT) 01.15.20 English Dataset contains ‘questions’ from chapters in the book by enumerating 21 consecutive sentences. In each question, the first 20 sentences form the context, and a word is removed from the 21st sentence, which becomes the query. Models must identify the answer word among a selection of 10 candidate answers appearing in the context sentences and the query. ~688,000 Text Question Answering, Reading Comprehension 2016 Hill et al.
Chinese Machine Reading Comprehension (CMRC 2018) 03.29.20 Chinese Dataset is composed by near 20,000 real questions annotated on Wikipedia paragraphs by human experts. 20,000 JSON Question Answering, Reading Comprehension 2018 Cui et al.
Choice of Plausible Alternatives (COPA) 01.15.20 English Dataset used for open-domain commonsense causal reasoning. 1,000 XML Commonsense Reasoning 2011 Roemmele et al.
Classify Emotional Relationships of Fictional Characters 01.21.20 English Dataset contains 19 short stories that are shorter than 1,500 words, and depict at least four different characters. 19 Text Text Corpora, Emotion Classification 2019 Kim et al.
Clinical Case Reports for Machine Reading Comprehension (CliCR) 01.15.20 English Dataset was built from clinical case reports, requiring the reader to answer the query with a medical problem/test/treatment entity. 100,000 JSON Question Answering, Reading Comprehension 2018 Šuster et al.
ClueWeb Corpora 01.15.20 English Annotated web pages from the ClueWeb09 and ClueWeb12 corpora. 340,451,982 Text Classification 2013 Gabrilovich et al.
Coached Conversational Preference Elicitation 01.15.20 English Dataset consisting of 502 English dialogs with 12,000 annotated utterances between a user and an assistant discussing movie preferences in natural language. 12,000 JSON Dialogue 2019 Radlinski et al.
Coarse Discourse 02.16.20 English Dataset contains discourse annotations and relations on threads from Reddit during 2016. Requires merging using Reddit API. 9,473 JSON Text Corpora 2017 Zhang et al.
Code-Mixed-Dialog 03.05.20 Multi-Lingual A goal-oriented dialog dataset containing code-mixed conversations. Specifically, text from the DSTC2 restaurant reservation dataset and create code-mixed versions of it in Hindi-English, Bengali-English, Gujarati-English and Tamil-English. 49,167 Text Dialogue 2018 Banerjee et al.
CommitmentBank 01.15.20 English Dataset contains naturally occurring discourses whose final sentence contains a clause-embedding predicate under an entailment canceling operator (question, modal, negation, antecedent of conditional). 1,200 CSV Entailment, Inference 2019 Marneffe et al.
Common Objects in Context (COCO) 01.29.20 English COCO is a large-scale object detection, segmentation, and captioning dataset. Dataset contains 330K images (>200K labeled) 1.5 million object instances, 80 object categories, 91 stuff categories, 5 captions per image. 330,000 JSON, JPG Automatic Image Captioning 2014 Lin et al.
Common Voice 01.15.20 Multi-Lingual Dataset containing audio in 29 languages and 2,454 recorded hours . n/a MP3 Speech Recognition 2019 Mozilla
CommonCrawl 01.15.20 Multi-Lingual Dataset contains data from 25 billion web pages. 25B WET Text Corpora 2013-2019 Common Crawl Foundation
CommonGen 03.29.20 English Dataset consists of 30k concept-sets with humanwritten sentences as references. 30,000 JSON Text Generation 2019 Lin et al.
CommonsenseQA 01.15.20 English Dataset contains multiple-choice question answering dataset that requires different types of commonsense knowledge to predict the correct answers . It contains 12,102 questions with one correct answer and four distractor answers. 12,012 JSON Question Answering, Reading Comprehension, Commonsense 2018 Talmor et al.
Complex Factoid Question Answering with Paraphrase Clusters (ComQA) 02.16.20 English The dataset contains questions with various challenging phenomena such as the need for temporal reasoning, comparison (e.g., comparatives, superlatives, ordinals), compositionality (multiple, possibly nested, subquestions with multiple entities), and unanswerable questions. 11,214 JSON Question Answering, Reading Comprehension 2019 Abujabal et al.
ComplexWebQuestions 01.15.20 English Dataset contains a large set of complex questions in natural language, and can be used in multiple ways. 34,689 JSON Question Answering, Reading Comprehension 2018 Talmor et al.
Conceptual Captions 01.15.20 English Dataset contains ~3.3M images annotated with captions to be used for the task of automatically producing a natural-language description for an image. 3,318,333 TSV Automatic Image Captioning 2018 Sharma et al.
Conference on Computational Natural Language Learning (CoNLL 2002) 02.16.20 Spanish, Dutch Spanish data is a collection of newswire articles made available by the Spanish EFE News Agency.The Dutch data consist of four editions of the Belgian newspaper "De Morgen" of 2000. IOB2 format. n/a HTML Named Entity Recognition (NER) 2002 Tjong et al.
Conference on Computational Natural Language Learning (CoNLL 2003) 02.06.20 English, German Dataset contains news articles whose text are segmented in 4 columns: the first item is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. English 1,393; German 909 Tar Text Corpora, Named Entity Recognition (NER), Part-of-Speech (POS) 2003 Sang et al.
Content-Based Categorized Dataset 03.29.20 Arabic Dataset contains 996 Web pages from the ArabicWeb16 dataset were extracted and labeled. 996 Text Text Classification 2016 Suwaileh et al.
Conversational Text-to-SQL Systems (CoSQL) 01.15.20 English Dataset consists of 30k+ turns plus 10k+ annotated SQL queries, obtained from a Wizard-of-Oz collection of 3k dialogues querying 200 complex databases spanning 138 domains.It is the dilaogue version of the Spider and SParC tasks. 3,000 JSON, SQL Dialogue, SQL-to-Text 2019 Yu et al.
Cornell Movie--Dialogs Corpus 01.15.20 English This corpus contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts. 220,579 conversational exchanges between 10,292 pairs of movie characters, involves 9,035 characters from 617 moviesin. total 304,713 utterances. 304,713 Text Dialogue 2011 Danescu et al.
Cornell Natural Language for Visual Reasoning (NLVR and NLVR2) 01.29.20 English Dataset contains two language grounding datasets containing natural language sentences grounded in images. The task is to determine whether a sentence is true about a visual input. NLVR2 107,292; NLVR 92,244 JSON Question Answering, Visual 2019 Suhr et al.
Cornell Newsroom 01.15.20 English Dataset contains 1.3 million articles and summaries written by authors and editors in the newsrooms of 38 major publications. The summaries are obtained from search and social metadata between 1998 and 2017. 1.3M JSON Text Corpora, Summarization 2018 Grusky et al.
Corporate Messaging Corpus 01.15.20 English Dataset contains classifed statements as information, dialog (replies to users, etc.), or action (messages that ask for votes or ask users to click on links, etc. 3,118 CSV Classification 2015 Crowdflower
Cosmos QA 01.15.20 English Dataset containing thousands of problems that require commonsense-based reading comprehension, formulated as multiple-choice questions. 35,000 CSV Question Answering, Reading Comprehension, Commonsense 2019 Huang et al.
Curation Corpus 03.29.20 English Dataset is a collection of 40,000 professionally-written summaries of news articles, with links to the articles themselves. 40,000 CSV Text Corpora 2020 Curation Corporation
Customer Interaction Data of German Emails and Online Requests 02.06.20 German Dataset is used to evaluate the task of automatically categorizing German customer requests. The dataset consists of a set emails and online requests sent to the support center of a multimedia software company. 627 XML Text Corpora 2014 Eichler et al.
DEXTER Dataset 01.15.20 English Task given is to determine, from features given, which articles are about corporate acquisitions. 2,600 Text Classification 2008 Reuters
DOGC 03.05.20 Catalan, Spanish A collection of documents from the official journal of the Catalan Goverment in Catalan and Spanish. 21.87M XML Text Corpora, Machine Translation 2012 Tiedemann et al.
DSL Corpus Collection (DSLCC) 01.15.20 Multi-Lingual Dataset contains short excerpts of journalistic texts in similar languages and dialects. 294,000 Text Discriminating between similar languages 2017 Tang et al.
DVQA 01.15.20 English Dataset containing data visualizations and natural language questions. 3,487,194 JSON, PNG Question Answering, Visual, Commonsense 2018 Kafle et al.
DailyDialog 01.21.20 English A manually labelled conversations dataset. Categories: no emotion, anger, disgust, fear, happiness, sadness, surprise. 13,118 Text Emotion Classification 2017 Li et al.
Danish-Similarity-Dataset 03.29.20 Danish Dataset consists of 99 word pairs rated by 38 human judges according to their semantic similarity. 99 CSV Semantic Textual Similarity 2019 Schneidermann
Dataset for Fill-in-the-Blank Humor 01.15.20 English Dataset contains 50 fill-in-the-blank stories similar in style to Mad Libs. The blanks in these stories include the original word and the hint type (e.g. animal, food, noun, adverb). 50 JSON Text Generation 2017 Hossain et al.
Dataset for Intent Classification and Out-of-Scope Prediction 01.21.20 English Dataset is a benchmark for evaluating intent classification systems for dialog systems / chatbots in the presence of out-of-scope queries. 23,000+ JSON Intent Classification 2019 Larson et al.
Dataset for the Machine Comprehension of Text 01.15.20 English Stories and associated questions for testing comprehension of text. 660 Text Question Answering, Reading Comprehension 2013 Richardson et al.
Dbpedia 01.15.20 Multi-Lingual The English version of the DBpedia knowledge base currently describes 6.6M entities of which 4.9M have abstracts, 1.9M have geo coordinates and 1.7M depictions. In total, 5.5M resources are classified in a consistent ontology. 6.6M Multiple Knowledge Base 2016 Dbpedia
Deal or No Deal? End-to-End Learning for Negotiation Dialogues 01.15.20 English This dataset consists of 5,808 dialogues, based on 2,236 unique scenarios dealing with negotiations and complex communication. 5,808 Text Dialogue 2017 Lewis et al.
Delta Reading Comprehension Dataset 03.29.20 Chinese Dataset organizes 10,014 paragraphs from 2,108 wiki entries and highlights more than 30,000 questions from the paragraphs. 10,014 JSON Question Answering, Reading Comprehension 2019 Shao et al.
Densely Annotated Wikipedia Texts (DAWT) 02.16.20 Multi-Lingual Dataset contains a total of 13.6M articles across several languages: English, Spanish, Italian, German, French and Arabic. The annotations include labeled text mentions mapping to entities (represented by their Freebase machine ids) as well as the type of entity. 13.6M JSON Named Entity Recognition (NER) 2017 Spasojevic et al.
Dialogue Natural Language Inference (NLI) 01.29.20 English Dataset used to improve the consistency of a dialogue model. It consists of sentence pairs labeled as entailment (E), neutral (N), or contradiction (C)." 340,000+ JSON Dialogue, Entailment 2019 Welleck et al.
DiscoFuse 01.21.20 English Dataset contains examples for training sentence fusion models. Sentence fusion is the task of joining several independent sentences into a single coherent text. The data has been collected from Wikipedia and from Sports articles. ~60M TSV Sentence Fusion 2019 Geva et al.
DuReader 01.15.20 Mandarin DuReader version 2.0 contains more than 300K question, 1.4M evidence documents and 660K human generated answers. 1,431,429 JSON Question Answering, Reading Comprehension 2018 He et al.
Dutch Book Reviews 01.21.20 Dutch Dataset contains book reviews along with associated binary sentiment polarity labels. 118,516 Text Classification, Sentiment Analysis 2019 van der Burgh
ECB Corpus 03.05.20 Multi-Lingual Website and documentation from the European Central Bank. Contains 19 languages. 30.55M XML Text Corpora, Machine Translation 2012 Tiedemann et al.
EMEA 03.05.20 Multi-Lingual A parallel corpus made out of PDF documents from the European Medicines Agency. Contains 22 languages. 26.51M XML Machine Translation 2012 Tiedemann et al.
EmoBank 01.29.20 English Dataset is a large-scale text corpus manually annotated with emotion according to the psychological Valence-Arousal-Dominance scheme. 10,000 CSV Text Corpora 2017 Buechel et al.
Emotion-Stimulus 01.21.20 English Dataset annotated with both the emotion and the stimulus using FrameNet’s emotions-directed frame. 820 sentences with both cause and emotion and 1594 sentences marked with their emotion tag. Categories: happiness, sadness, anger, fear, surprise, disgust and shame. 2,414 XML Emotion Classification 2015 Ghazi et al.
EmpatheticDialogues 01.29.20 English Dataset of 25k conversations grounded in emotional situations. 25,000 CSV Dialogue 2019 Rashkin et al.
Enron Email Dataset 01.15.20 English Emails from employees at Enron organized into folders. ~500,000 Text Text Corpora 2004 (2015) Klimt et al.
Eubookshop 03.05.20 Multi-Lingual Corpus of documents from the EU bookshop. Contains 48 languages. 173.20M XML Text Corpora, Machine Translation 2012 Tiedemann et al.
Europarl-ST 03.05.20 Multi-Lingual Dataset contains paired audio-text samples for speech translation, constructed using the debates carried out in the European Parliament in the period between 2008 and 2012. Contains 6 Euro languages: German, English, Spanish, French, Italian and Portuguese. n/a n/a Speech Translation 2020 Iranzo-Sánchez et al.
European Parliament Proceedings (Europarl) 01.15.20 Multi-Lingual The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 21 European languages. 10M+ XML Text Corpora, Machine Translation 2002 Koehn et al.
Europeana Newspapers 02.16.20 Multi-Lingual Named Entity Recognition corpora for Dutch, French, German languages from Europeana Newspapers. Data is encoded in the IOB format. 486,218 BIO Named Entity Recognition (NER) 2016 Neudecker
Event-focused Emotion Corpora for German and English 01.21.20 English, German German and English emotion corpora for emotion classification, annotated with crowdsourcing in the style of the ISEAR resources. 2,002 TSV Text Corpora, Emotion Classification 2019 Troiano et al.
Event2Mind 01.21.20 English Dataset contains 25,000 events and free-form descriptions of their intents and reactions 25,000 CSV Commonsense Inference 2018 Rashkin et al.
Examiner Pseudo-News Corpus 01.15.20 English Clickbait, spam, crowd-sourced headlines from 2010 to 2015. 3,089,781 CSV Clustering, Events, Sentiment Analysis 2017 Kulkarni
Excitement Datasets 02.06.20 English, Italian Datasets contain negative feedbacks from customers where they state reasons for dissatisfaction with a given company. The datasets are available in English and Italian. n/a XML Classification, Sentiment Analysis 2015 Kotlerman et al.
Explain Like I’m Five (ELI5) 03.05.20 English The dataset contains 270K threads of open-ended questions that require multi-sentence answers. It was extracted from subreddit titled “Explain Like I’m Five” (ELI5), in which an online community answers questions with responses that 5-year-olds can comprehend. Facebook scripts allow you to preprocess data. 270,000 Text Question Answering, Reading Comprehension 2019 Fan et al.
Explanations for Science Questions 01.15.20 English Data contains: gold explanation sentences supporting 363 science questions, relation annotation for a subset of those explanations, and a graphical annotation tool with annotation guidelines. 1,363 CSV Question Answering, Reading Comprehension 2016 Jansen et al.
FQuAD 03.05.20 French Dataset contains 25,000+ questions on a set of Wikipedia articles, modeled after SQuAD. 25,000+ JSON Question Answering, Reading Comprehension 2020 d’Hoffschmidt et al.
Fact-based Visual Question Answering (FVQA) 01.29.20 English Dataset contains image question anwering triples 5,826 questions; 2,190 images JSON Question Answering, Visual 2017 Wang et al.
Finlex 03.05.20 Finnish, Swedish Dataset is a collection of legislative and other judicial information of Finland, which is available in Finnish and Swedish. 7.98M XML Text Corpora, Machine Translation 2012 Tiedemann et al.
Finnish News Corpus for Named Entity Recognition 03.29.20 Finnish Dataset contains 953 articles (193,742 word tokens) with 6 named entity classes: organization, location, person, product, event, and date. 953 CSV Named Entity Recognition (NER) 2018 Güngör & Sohrab et al.
Fiskmö 03.05.20 Finnish, Swedish Dataset is a parallel corpus of Finnish and Swedish Languages. 4.24M XML Machine Translation 2012 Tiedemann et al.
GAP Coreference Dataset 02.16.20 English Dataset contains 8,908 gender-balanced coreference-labeled pairs of (ambiguous pronoun, antecedent name), sampled from Wikipedia. 8,908 TSV Coreference Resolution 2018 Webster et al.
GQA 01.15.20 English Question answering on image scene graphs. 22M JSON, H5 Question Answering, Visual, Commonsense 2019 Hudson et al.
GeoQuery 02.16.20 English Dataset contains utterances issued to a database of US geographical facts. 877 JSON Semantic Parsing, Text-to-SQL 2017 Zelle & Iyer et al.
GermEval 2014 NER Shared Task 02.06.20 German The data was sampled from German Wikipedia and News Corpora as a collection of citations.The dataset covers over 31,000 sentences corresponding to over 590,000 tokens. 31,000+ TSV Named Entity Recognition 2014 Benikova et al.
Global Voices Parallel Corpus 02.06.20 Multi-Lingual Dataset contains news articles from the web site Global Voices in multiple languages. n/a Text Machine Translation 2015 CASMACAT
Google Books N-grams 01.15.20 Multi-Lingual N-grams from a very large corpus of books. 2.2 TB of text Text Classification, Clustering 2011 Google
Groningen Meaning Bank 02.06.20 English Datasets contains texts in raw and tokenised format, tags for part of speech, named entities and lexical categories, and discourse representation structures compatible with first-order logic. 10,000 XML Text Corpora 2014 University of Groningen
Guttenberg Book Corpus 01.15.20 Multi-Lingual Dataset contains 60,000 eBooks. 60,000 Text Text Corpora 1996-2019 Guttenberg
Hansards Canadian Parliament 01.15.20 English Dataset contains pairs of aligned text chunks (sentences or smaller fragments) from the official records (Hansards) of the 36th Canadian Parliament. 1.3M Text Text Corpora 2001 Natural Language Group - USC
Harvard Library 01.15.20 English Dataset contains books, journals, electronic resources, manuscripts, archival materials, scores, audio, video and other materials. 12.7M MODS, Dublin Core Text Corpora n/a Harvard
Hate Speech Identification Dataset 01.15.20 English Dataset contains lexicons, notebooks containing content that is racist, sexist, homophobic, and offensive in general. n/a CSV Classification 2017 Davidson et al.
HellaSwag 01.29.20 English Dataset for studying grounded commonsense inference. It consists of 70k multiple choice questions about grounded situations: each question comes from one of two domains -- activitynet or wikihow -- with four answer choices about what might happen next in the scene. 70,000 JSON Commonsense Reasoning 2019 Zellers et al.
Historical Newspapers Daily Word Time Series Dataset 01.15.20 English Dataset contains daily contents of newspapers published in the US and UK from 1836 to 1922. 25,000 n/a Text Corpora 2017 Dzogang et al.
Home Depot Product Search Relevance 01.15.20 English Dataset contains a number of products and real customer search terms from Home Depot's website. n/a CSV Classification 2015 Home Depot
HotpotQA 01.15.20 English Dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems. 1.25M JSON Question Answering, Reading Comprehension 2018 Yang et al.
How2 03.05.20 English, Portuguese Dataset of instructional videos covering a wide variety of topics across video clips (about 2,000 hours), with word-level time alignments to the ground-truth English subtitles. And 300 hours was translated into Portuguese subtitles. ~2,000 Hours n/a Speech-to-Text, Translation, Summarization, Visual 2018 Sanabria et al.
Human-in-the-loop Dialogue Simulator (HITL) 01.15.20 English Dataset provides a framework for evaluating a bot’s ability to learn to improve its performance in an online setting using feedback from its dialog partner. The dataset contains questions based on the bAbI and WikiMovies datasets, with the addition of feedback from the dialog partner. n/a Text Question Answering, Reading Comprehension 2016 Li et al.
IIT Bombay English-Hindi Corpus 01.21.20 English, Hindi Dataset contains parallel corpus for English-Hindi as well as monolingual Hindi corpus collected from a variety of existing sources. 1.49M n/a Machine Translation 2018 Kunchukuttan et al.
IWSLT 15 English-Vietnamese 01.15.20 Multi-Lingual Sentence pairs for translation. 133,000 Text Machine Translation 2015 Stanford
IWSLT'15 English-Vietnamese  02.06.20 Multi-Lingual Parallel corpus used for machine translation English-Vietnamese. ~130,000 Text Machine Translation 2015 Hong et al.
Indic Languages Multilingual Parallel Corpus 02.16.20 Indian Dataset contains several languages: Bengali, Hindi, Malayalam, Tamil, Telugu, Sinhalese, Urdu and English. The corpus has been collected from OPUS and belongs to the spoken language (OpenSubtitles) domain. n/a Tar Machine Translation 2018 NICT & Kyoto Univ.
InsuranceQA 01.29.20 English Dataset contains questions and answers collected from the website Insurance Library. It consists of questions from real world users, the answers with high quality were composed by professionals with deep domain knowledge. There are 16,889 questions in total. 16,889 n/a Question Answering, Reading Comprehension 2015 Feng et al.
Irony Sarcasm Analysis Corpus 01.29.20 English Dataset contains tweets in 4 subgroups: irony, sarcasm, regular and figurative. Requires using Twitter API in order to obtain tweets. 33,000 TSV Classification, Sentiment Analysis 2016 Ling et al.
Jeapardy Questions Answers 01.15.20 English Dataset contains Jeopardy questions, answers and other data. 216,930 JSON Question Answering, Reading Comprehension 2014 Anonymous
Kensho Derived Wikimedia Dataset (KDWD) 02.06.20 English Dataset contains two main components - a link annotated corpus of English Wikipedia pages and a compact sample of the Wikidata knowledge base. n/a CSV, JSON Text Corpora, Knowledge Base 2020 Kensho R&D
Khaleej-2004 Corpus 03.29.20 Arabic Dataset contains more than 5,000 articles which correspond to nearly 3 millions words across 4 topics: International News, Local News, Economy, and Sports. 5,690 HTML Text Corpora 2004 Abbas et al.
KorQuAD 03.05.20 Korean Dataset containing a total of 100,000+ question answer pairs. 102,960 JSON Question Answering, Reading Comprehension 2019 Lim et al.
LC-QuAD 2.0 03.05.20 English Dataset contains questions and SPARQL queries. LC-QuAD uses DBpedia v04.16 as the target KB. 30,000 JSON Question Answering, Knowledge Graph 2017 Dubey et al.
Language Modeling Broadened to Account for Discourse Aspects (LAMBADA) 02.06.20 English Dataset contains narrative passages sharing the characteristic that human subjects are able to guess their last word if they are exposed to the whole passage, but not if they only see the last sentence preceding the target word. 10,022 Text Natural Language Understanding, Language Modeling 2016 Paperno et al.
Large Movie Review Dataset - Imdb 02.06.20 English Dataset contains 25,000 highly polar movie reviews for training, and 25,000 for testing 50,000 Text Classification, Sentiment Analysis 2011 Maas et al.
Legal Case Reports 01.15.20 English Federal Court of Australia cases from 2006 to 2009. 4,000 Text Classification 2012 Galgani et al.
LibriSpeech ASR 01.15.20 English Large-scale (1000 hours) corpus of read English speech. n/a FLAC Speech Recognition 2015 OpenSLR
LibriVoxDeEn 03.05.20 German, English Dataset contains sentence-aligned triples of German audio, German text, and English translation, based on German audio books. The corpus consists of over 100 hours of audio material and over 50k parallel sentences. 50,000+ Text, TSV Speech Translation, Machine Translation 2019 Beilharz et al.
Ling-Spam Dataset 01.15.20 English Corpus contains both legitimate and spam emails. n/a Text Classification 2000 Androutsopoulos et al.
LitBank 02.06.20 English Dataset contains 100 works of English-language fiction. It currently contains annotations for entities, events and entity coreference in a sample of ~2,000 words from each of those texts, totaling 210,532 tokens. 100 TSV, Text Named Entity Recognition 2019 Bamman et al.
MSParS 01.15.20 English Dataset for the open domain semantic parsing task. 81,826 Satori Semantic Parsing 2019 Microsoft
Meta-Learning Wizard-of-Oz (MetaLWOz) 01.15.20 English Dataset designed to help develop models capable of predicting user responses in unseen domains. It was created by crowdsourcing 37,884 goal-oriented dialogs, covering 227 tasks in 47 domains. 37,884 Text Dialogue 2019 Microsoft
Microsoft Information-Seeking Conversation (MISC) dataset 01.15.20 English Dataset contains recordings of information-seeking conversations between human “seekers” and “intermediaries”. It includes audio and video signals; transcripts of conversation; affectual and physiological signals; recordings of search and other computer use; and post-task surveys on emotion, success, and effort. n/a various Speech Recognition, Dialogue, Visual 2018 Microsoft
Microsoft Machine Reading COmprehension Dataset (MS MARCO) 01.15.20 English Dataset focused on machine reading comprehension, question answering, and passage ranking, keyphrase extraction, and conversational search studies. 1,010,916 JSON Question Answering, Reading Comprehension 2016 Bajaj et al.
Microsoft Research Paraphrase Corpus 01.15.20 English Dataset contains pairs of sentences which have been extracted from news sources on the web, along with human annotations indicating whether each pair captures a paraphrase/semantic equivalence relationship. 5,800 Text Paraphrasing 2005 Dolan et al.
Microsoft Research Social Media Conversation Corpus 01.15.20 English A-B-A triples extracted from Twitter. 4,232 Text Graph Analysis 2016 Sordoni et al.
Microsoft Speech Corpus 01.15.20 Indian Dataset contains conversational and phrasal speech training and test data for Telugu, Tamil and Gujarati languages. n/a Wav Speech Recognition 2019 Microsoft
Microsoft Speech Language Translation Corpus (MSLT) 01.15.20 Multi-Lingual Dataset contains conversational, bilingual speech test and tuning data for English, Chinese, and Japanese. It includes audio data, transcripts, and translations; and allows end-to-end testing of spoken language translation systems on real-world data. n/a Wav Speech Recognition, Machine Translation 2017 Federmann et al.
MovieLens 01.15.20 English Dataset contains 22,000,000 ratings and 580,000 tags applied to 33,000 movies by 240,000 users. ~22M Text Clustering, Classification, Regression 2016 Harper et al.
MovieTweetings 01.15.20 English Movie rating dataset based on public and well-structured tweets. 822,784 Text Classification, Regression 2018 Dooms
MuST-C 03.05.20 Multi-Lingual Dataset is a speech translation corpus containing 385 hours from Ted talks for speech translation from English into several languages: Dutch, French, German, Italian, Portuguese, Romanian, Russian, & Spanish. Requires filling request form. 385 Hours n/a Speech Translation 2019 Di Gangi et al.
Multi-Domain Wizard-of-Oz Dataset (MultiWoz) 01.15.20 English Dataset of human-human written conversations spanning over multiple domains and topics. The dataset was collected based on the Wizard of Oz experiment on Amazon MTurk. 10,438 JSON Dialogue 2018 Budzianowski et al.
Multi30k 03.29.20 German, English Dataset of images paired with sentences in English and German. This dataset extends the Flickr30K dataset. 31,014 n/a Machine Translation, Multi-Modal Learning 2016 Elliott et al.
MultiLing Pilot 2011 Dataset 03.29.20 Multi-Lingual Dataset is derived from publicly available WikiNews English texts and translated into 7 languages: Arabic, Czech, English, French, Greek, Hebrew, Hindi. n/a Text Summarization 2011 Giannakopoulos et al.
MultiLingual Question Answering (MLQA) 02.06.20 Multi-Lingual Dataset for evaluating cross-lingual question answering performance. ~12K QA instances in English and 5K in each other language in SQuAD format in seven languages - English, Arabic, German, Spanish, Hindi, Vietnamese and Simplified Chinese. 46,444 JSON Question Answering, Reading Comprehension 2019 Lewis et al.
MultiNLI Matched/Mismatched 01.15.20 English Dataset contains sentence pairs annotated with textual entailment information. 433,000 JSON, Text Entailment 2017 Williams et al.
Multilingual Corpus of Sentence-Aligned Spoken Utterances (MaSS) 03.29.20 Multi-Lingual Dataset of 8,130 parallel spoken utterances across 8 languages (56 language pairs). Languages: Basque, English, Finnish, French. Hungarian, Romanian, Russian, Spanish. 8,130 n/a Speech Corpora 2020 Boito et al.
Multimodal Comprehension of Cooking Recipes (RecipeQA) 01.15.20 English Dataset for multimodal comprehension of cooking recipes. It consists of over 36K question-answer pairs automatically generated from approximately 20K unique recipes with step-by-step instructions and images. 20,000 JSON Question Answering, Reading Comprehension 2018 Yagcioglu et al.
MutualFriends 01.15.20 English Task where two agents must discover which friend of theirs is mutual based on the friend's attributes. n/a JSON Dialogue 2017 He et al.
NLP Chinese Corpus 01.15.20 Chinese Large text corpora in Chinese. 10M+ JSON Text Corpora 2019 Xu et al.
NPS Chat Corpus 01.15.20 English Posts from age-specific online chat rooms. ~500,000 XML Dialogue 2007 Forsyth et al.
NUS SMS Corpus 01.15.20 Mandarin, English SMS messages collected between 2 users, with timing analysis. 67,093 XML Dialogue 2013 Kan et al.
NYSK Dataset 01.15.20 English English news articles about the case relating to allegations of sexual assault against the former IMF director Dominique Strauss-Kahn. 10,421 XML Sentiment Analysis, Topic Extraction 2013 Dermouche et al.
Named Entity Model for German, Politics (NEMGP) 02.16.20 German Dataset contains texts from Wikipedia and WikiNews, manually annotated with named entity information. 5,094 Text Named Entity Recognition (NER) 2013 Zastrow
NarrativeQA 01.15.20 English Dataset contains the list of documents with Wikipedia summaries, links to full stories, and questions and answers. 1,572 CSV Question Answering, Reading Comprehension 2017 Kočiský et al.
Natural Questions (NQ) 01.15.20 English Dataset contains questions from real users, and it requires QA systems to read and comprehend an entire Wikipedia article that may or may not contain the answer to the question. 320,000+ HTML Question Answering, Reading Comprehension 2019 Kwiatkowski et al.
Neutralizing Biased Text 03.29.20 English A parallel corpus of 180,000+ sentence pairs where one sentence is biased and the other is neutralized. The data were obtained from debiasing wikipedia edits. 180,000 n/a Biased Text Neutralization 2019 Pryzant et al.
News Headlines Dataset for Sarcasm Detection 01.15.20 English High quality dataset with Sarcastic and Non-sarcastic news headlines. 26,709 JSON Clustering, Events, Language Detection 2018 Misra
News Headlines Of India 01.15.20 English Dataset contains archive of noteable events in India during 2001-2018, recorded by the Times of India. 2,969,922 CSV Text Corpora 2017 Kaggle
NewsQA 01.15.20 English Crowdworkers supply questions and answers based on a set of over 10,000 news articles from CNN. 12,744 JSON, CSV Question Answering, Reading Comprehension 2017 Trischler et al.
One Week of Global News Feeds 01.15.20 Multi-Lingual Dataset contains most of the new news content published online over one week in 2017 and 2018. 3.3M CSV Text Corpora 2018 Kulkarni et al.
OneCommon 01.29.20 English Dataset contains 6,760 dialogues. 6,760 JSON Dialogue 2019 Udagawa et al.
OneSeC Small 03.29.20 Multi-Lingual Automatically-generated corpora in multiple languages with sense annotations for nouns using WordNet for English and BabelNet for all other languages as inventories of senses. 1M+ XML Word Sense Disambiguation  2019 Scarlini et al.
OntoNotes 5.0 01.21.20 Multi-Lingual Dataset contains various genres of text (news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference). n/a Text, SQL Information Retrieval, Syntactic Parsing 2013 Weischedel et al.
Open Images V6 03.05.20 English Dataset containing millions of images that have been annotated with image-level labels and object bounding boxes. 9,178,275 TSV, CSV Automatic Image Captioning 2018 Kuznetsova et al.
Open Research Corpus 01.15.20 English Dataset contains over 39 million published research papers in Computer Science, Neuroscience, and Biomedical. 39M JSON Text Corpora 2018 Ammar et al.
OpenBookQA 01.15.20 English Dataset modeled after open book exams for assessing human understanding of a subject. It consists of 5,957 multiple-choice elementary-level science questions (4,957 train, 500 dev, 500 test), which probe the understanding of a small "book" of 1,326 core science facts and the application of these facts to novel situations. 5,957 JSON Question Answering, Reading Comprehension 2018 Mihaylov et al.
OpenSubtitles 01.29.20 Multi-Lingual Dataset of multi-lingual dialogs from movie scripts. Includes 62 languages. n/a XML, XCES Dialogue 2016 Tiedemann et al.
OpenWebTextCorpus 01.15.20 English Dataset contains millions of webpages text stemming from reddit urls totalling 38Gb of text data. 8,013,769 n/a Text Corpora 2019 Radford et al.
Open Super-Large Crawled Almanach Corpus (OSCAR) 01.29.20 Multi-Lingual Multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.166 different languages available. n/a Text Text Corpora 2019 Suárez et al.
OpinRank Review Dataset 01.15.20 English Reviews of cars and hotels from and TripAdvisor. Edmunds: 42,230, TripAdivsor: 259,000 Text Information Retrieval, Entity Ranking, Entiry Retrieval 2011 Ganesan et al.
PG-19 02.16.20 English Dataset contains a set of books extracted rom the Project Gutenberg books library, that were published before 1919. It also contains metadata of book titles and publication dates. 28,752 Text Text Corpora, Language Modeling 2019 Rae et al.
ParCorFull 03.29.20 German, English A parallel corpus annotated for the task of translation of corefrence across languages. 14,927 XML Machine Translation, Coreference Resolution 2018 Lapshinova-Koltunski et al.
Parallel Arabic DIalectal Corpus (PADIC) 03.29.20 Arabic Dataset is a multi-dialectal corpus - contains six dialects in addition to MSA in Buckwalter format. 6,000+ HTML Text Corpora 2013 Abbas et al.
Parallel Meaning Bank 02.06.20 Multi-Lingual Dataset contains sentences and texts in raw and tokenised format, syntactic analysis, word senses, thematic roles, reference resolution, and formal meaning representations. The annotated parallel corpus inclues English, German, Dutch and Italian languages. 8,705 XML Text Corpora 2017 University of Groningen
Paraphrase Adversaries from Word Scrambling (PAWS) 01.21.20 English Dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase identification. 750,000+ TSV Paraphrasing Identification 2019 Zhang et al.
Paraphrase Adversaries from Word Scrambling (PAWS-X) 01.21.20 Multi-Lingual Dataset contains 23,659 human translated PAWS evaluation pairs and 296,406 machine translated training pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. All translated pairs are sourced from examples in PAWS-Wiki. 300,000+ TSV Paraphrasing Identification 2019 Yang et al.
Paraphrase and Semantic Similarity in Twitter (PIT) 01.15.20 English Dataset focuses on whether tweets have (almost) same meaning/information or not. 18,762 Text Classification 2015 Xu et al.
Personae Corpus 01.15.20 Dutch Collected for experiments in Authorship Attribution and Personality Prediction. Consists of 145 Dutch-language essays. 145 Text Classification, Regression 2008 Luyckx et al.
Personalized Dialog 01.15.20 English Dataset of dialogs from movie scripts. 12,000 Text Dialogue 2017 Joshi et al.
Physical IQA 01.29.20 English Dataset is used for commonsense QA benchmark for naive physics reasoning focusing on how we interact with everyday objects in everyday situations. The dataset includes 20,000 QA pairs that are either multiple-choice or true/false questions. 20,000 JSON Question Answering, Commonsense 2019 Bisk et al.
Plaintext Jokes 01.15.20 English 208,000 jokes in this database scraped from three sources. 208,000 JSON Text Corpora 2016 Pungas et al.
Portuguese Newswire Corpus 01.21.20 Portuguese (Brazil) Dataset contains x number of newswire articles collected between years 1994-2016. Requires preprocesing of HTML pages, found in GitHub in the download link. n/a HTML Text Corpora 2016 Boğaziçi University
Portuguese SQuAD v1.1 01.21.20 Portuguese Portuguese translation of the SQuAD dataset. The translation was performed using the Google Cloud API. ~100,000 JSON Question Answering, Reading Comprehension 2019 Carvalho et al.
ProPara Dataset 01.15.20 English Dataset is used for comprehension of simple paragraphs describing processes, e.g., photosynthesis. The comprehension task relies on predicting, tracking, and answering questions about how entities change during the process. 488  Google Sheets Question Answering, Reading Comprehension 2018 Mishra et al.
QA-SRL Bank 01.29.20 English Dataset contains question answer pairs for 64,000 sentences. Dataset is used to train model for semantic role labeling 64,000 JSON Question Answering, Semantic Role Labeling 2018 FitzGerald et al.
QA-ZRE 01.29.20 English Dataset contain question answer pairs with each instance containing a relation, a question, a sentence, and an answer set. 30M Text Question Answering, Relation Extraction 2017 Levy et al.
QASC 02.06.20 English QASC is a question-answering dataset with a focus on sentence composition. It consists of 9,980 8-way multiple-choice questions about grade school science (8,134 train, 926 dev, 920 test), and comes with a corpus of 17M sentences. 9,980 JSON Question Answering, Reading Comprehension 2020 Khot et al.
QuaRTz Dataset 01.15.20 English Dataset contains 3,864 questions about open domain qualitative relationships. Each question is paired with one of 405 different background sentences (sometimes short paragraphs). 3,864 JSON Question Answering, Reading Comprehension 2019 Tajford et al.
QuaRel Dataset 01.15.20 English Dataset contains 2,771 story questions about qualitative relationships. 2,771 JSON Question Answering, Reading Comprehension 2018 Tajford et al.
Quasar-S & T 01.15.20 English The Quasar-S dataset consists of 37,000 cloze-style queries constructed from definitions of software entity tags on the popular website Stack Overflow. The Quasar-T dataset consists of 43,000 open-domain trivia questions and their answers obtained from various internet sources. 80,000 JSON Question Answering, Reading Comprehension 2017 Dhingra et al.
Question Answering in Context (QuAC) 01.15.20 English Dataset for modeling, understanding, and participating in information seeking dialog. 14,000 JSON Question Answering, Reading Comprehension 2018 Choi et al.
Question NLI 01.15.20 English Dataset converts SQuAD dataset into sentence pair classification by forming a pair between each question and each sentence in the corresponding context. 110,000 JSON Inference 2018 Rajpurkar et al.
Quora Question Pairs 01.15.20 English The task is to determine whether a pair of questions are semantically equivalent. 400,000 TSV Semantic Similarity 2017 Quora
Quoref 02.06.20 English Dataset which tests the coreferential reasoning capability of reading comprehension systems. In this span-selection benchmark containing 24K questions over 4.7K paragraphs from Wikipedia, a system must resolve hard coreferences before selecting the appropriate span(s) in the paragraphs for answering questions. 24,000 JSON Question Answering, Reading Comprehension 2019 Dasigi et al.
ReAding Comprehension Dataset From Examinations (RACE) 01.15.20 English Dataset was collected from the English exams evaluating the students' ability in understanding and reasoning. 28,000 JSON Question Answering, Reading Comprehension 2017 Lai et al.
ReVerb45k, Base and Ambiguous 01.29.20 English 3 Datasets. In total, there are 91K triples. 91,000 JSON Information Retrieval, Knowledge Base 2018 Vashishth et al.
Reading Comprehension over Multiple Sentences (MultiRC) 01.15.20 English Dataset of short paragraphs and multi-sentence questions that can be answered from the content of the paragraph. ~10,000 JSON Question Answering, Reading Comprehension 2018 Khashabi et al.
Reading Comprehension with Commonsense Reasoning Dataset (Record) 01.15.20 English Reading comprehension dataset which requires commonsense reasoning. Contains 120,000+ queries from 70,000+ news articles. 70,000+ JSON Question Answering, Reading Comprehension 2018 Zhang et al.
Reading Comprehension with Multiple Hops (Qangaroo) 01.15.20 English Reading Comprehension datasets focussing on multi-hop (alias multi-step) inference. There are 2 datasets: Wikihop (based on wikipedia) and Medhop (based on PubMed research papers). ~53,000 JSON Question Answering, Reading Comprehension 2018 Welbl et al.
Recognizing Textual Entailment (RTE) 01.15.20 English Datasets are combined and converted to two-class classification: entailment and not_entailment. n/a JSON Entailment 2006-2009 Dagan et al, Bar Haim et al, Giampiccolo, and Bentivogli et al.
Reddit All Comments Corpus 01.15.20 English All Reddit comments (as of 2017). 3,329,219,008 JSON Text Corpora 2017 Reddit
Relation Extraction Corpus 01.21.20 English A human-judged dataset of two relations involving public figures on Wikipedia: about 10,000 examples of "place of birth" and 40,000 examples of "attended or graduated from an institution." 10,000 JSON Relation Extraction 2013 Google
Relationship and Entity Extraction Evaluation Dataset (RE3D) 01.15.20 English Entity and Relation marked data from various news and government sources. n/a JSON Classification, Entity and Relation Recognition 2017 Dstl
Restaurants 02.16.20 English Dataset contains user questions about restaurants, their food types, and locations. 378 JSON Semantic Parsing, Text-to-SQL 2012 Tang/Popescu/
Reuters News Wire Headline 01.15.20 English Dataset contains 11 years of timestamped events published on the news-wire. 16,121,310 TSV Clustering, Events, Language Detection 2018 Kulkarni
SMS Spam Collection Dataset 01.15.20 English Dataset contains SMS spam messages. 5,574 Text Classification 2011 Almeida et al.
SNAP Social Circles: Twitter Database 01.15.20 English Large Twitter network data. Nodes: 81,306, Edges:1,768,149 Text Clustering, Graph Analysis 2012 McAuley et al.
SQuAD v2.0 01.15.20 English Paragraphs w/ questions and answers. 150,000 JSON Question Answering, Reading Comprehension 2018 Rajpurkar et al.
SQuAD-it 03.05.20 Italian The dataset contains more than 60,000 question/answer pairs in Italian derived from the original English SQuAD dataset. 60,000+ JSON Question Answering, Reading Comprehension 2018 Croce et al.
Saudi Newspapers Corpus 01.15.20 Arabic Dataset contains 31,030 Arabic newspaper articles. 31,030 JSON Text Corpora 2015 Alhagri
SberQuAD 03.05.20 Russian Dataset consists of a question answers modeleld after SQuAD. 50,364 CSV Question Answering, Reading Comprehension 2019 Efimov et al.
Schema-Guided Dialogue State Tracking (DSTC 8) 01.15.20 English Dataset contains 18K dialogues between a virtual assistant and a user. ~18,000 JSON Dialogue State Tracking 2019 Rastogi et al.
Scholar 02.16.20 English User questions about academic publications, with automatically generated SQL that was checked by asking the user if the output was correct. 817 JSON Semantic Parsing, Text-to-SQL 2017 Iyer et al.
SciQ Dataset 01.15.20 English Dataset contains 13,679 crowdsourced science exam questions about Physics, Chemistry and Biology, among others. The questions are in multiple-choice format with 4 answer options each. 13,769 JSON Question Answering, Reading Comprehension 2017 Welbl et al.
SciTail Dataset 01.15.20 English Dataset is an entailment dataset created from multiple-choice science exams and web sentences. Each question and the correct answer choice are converted into an assertive statement to form the hypothesis. 27,026 SNLI, TSV, DGEM Entailment 2018 Khot et al.
SearchQA 01.15.20 English Dataset from Jeapardy archives which consists of more than 140k question-answer pairs with each pair having 49.6 snippets on average. 140,000 JSON Question Answering, Reading Comprehension 2017 Dunn et al.
SemEval-2016 Task 4 03.29.20 English Dataset contains 5 subtasks involving the sentiment analysis of tweets. ~75,000 Text Classification, Sentiment Analysis 2016 Nakov et al.
SemEval-2019 Task 9 - Subtask A 02.06.20 English Suggestion Mining from Online Reviews and Forums: Dataset contains corpora of unstructured text with the intent for mining it for suggestions. ~6,300 CSV Suggestion Mining 2019 Negi et al.
SemEval-2019 Task 9 - Subtask B 02.06.20 English Suggestion Mining from Hotel Reviews: Dataset contains corpora of unstructured text with the intent for mining it for suggestions. ~800 CSV Suggestion Mining 2019 Negi et al.
SemEvalCQA 01.15.20 Arabic, English Dataset for community question answering. n/a XML Question Answering, Reading Comprehension 2016 Nakov et al.
Semantic Parsing in Context (SParC) 01.15.20 English Dataset consists of 4,298 coherent question sequences (12k+ unique individual questions annotated with SQL queries annotated byt. It is the context-dependent/multi-turn version of the Spider task. 4,298 JSON, SQL Semantic Parsing, SQL-to-Text 2019 Yu et al.
Semantic Textual Similarity Benchmark 01.15.20 English The task is to predict textual similarity between sentence pairs. 8,628 CSV Semantic Similarity 2017 Cer et al.
Sentences Involving Compositional Knowledge (SICK) 02.06.20 English Dataset contains sentence pairs, generated from two existing sets: the 8K ImageFlickr data set and the SemEval 2012 STS MSR-Video Description. ~10,000 Text Semantic Similarity, Entailment 2014 Marelli et al.
Sentiment Corpus of App Reviews with Fine-grained Annotations in German (SCARE) 01.29.20 German Dataset consists of fine-grained annotations for mobile application reviews from the Google Play Store. For each user review the mentioned application aspects, i.e., the design or the usability, as well as subjective phrases, which evaluate these aspects, are annotated. In addition, the polarity (positive, negative or neutral) of each subjective phrase is recorded as well as the relationship of an aspect to the main app in discussion. Requires emailing source for password to retrieve data. 800,000 CSV Sentiment Analysis 2016 Sänger et al.
Sentiment Labeled Sentences Dataset 01.15.20 English Dataset contains 3000 sentiment labeled sentences. 3,000 Text Classification, Sentiment Analysis 2015 Kotzias
Sentiment140 01.15.20 English Tweet data from 2009 including original text, time stamp, user and sentiment. 1,578,627 CSV Sentiment Analysis 2009 Go et al.
Shaping Answers with Rules through Conversation (ShARC) 01.15.20 English ShARC is a Conversational Question Answering dataset focussing on question answering from texts containing rules. 32,000 JSON Question Answering, Reading Comprehension 2018 Saeidi et al.
Short Answer Scoring 01.15.20 English Student-written short-answer responses. n/a TSV Scoring Classification 2012 The Hewlett Foundation
Simplified Versions of the CommAI Navigation tasks (SCAN) 01.29.20 English Dataset used for for studying compositional learning and zero-shot generalization. SCAN consists of a set of commands and their corresponding action sequences. 20,000+ Text Compositional Learning 2018 Lake et al.
Situations With Adversarial Generations (SWAG) 01.15.20 English Dataset consists of 113k multiple choice questions about grounded situations. Each question is a video caption from LSMDC or ActivityNet Captions, with four answer choices about what might happen next in the scene. 113,000 CSV Question Answering, Reading Comprehension 2018 Zellers et al.
Skytrax User Reviews Dataset 01.15.20 English User reviews of airlines, airports, seats, and lounges from Skytrax. 41,396 CSV Classification, Sentiment Analysis 2015 Nguyen
Soccer Dialogues 01.21.20 English Dataset contains soccer dialogues over a knowledge graph 2,890 JSON Knowledge Graphs, Dialogue 2019 SDA Lab, Uni. Of Bonn & Volkswagen Research
Social IQA 01.29.20 English Dataset used fo question-answering benchmark for testing social commonsense intelligence. 37,000+ JSON Question Answering, Commonsense 2019 Sap et al.
Social Media Mining for Health (SMM4H) 01.21.20 English Dataset contains medication-related text classification and concept normalization from Twitter 25,678 Text Classification 2018 Sarker et al.
Social-IQ Dataset 01.15.20 English Dataset containing videos and natural language questions for visual reasoning. 7,500 n/a Question Answering, Visual, Commonsense 2019 Zadeh et al.
Spambase Dataset 01.15.20 English Dataset contains spam emails. 4,601 Text Classification 1999 Hopkins et al.
Spider 1.0 01.15.20 English Dataset consists of 10,181 questions and 5,693 unique complex SQL queries on 200 databases with multiple tables covering 138 different domains. 10,181 JSON, SQL Semantic Parsing, SQL-to-Text 2018 Yu et al.
Stack Overlow BigQuery Dataset 01.15.20 English BigQuery dataset includes an archive of Stack Overflow content, including posts, votes, tags, and badges. n/a n/a Text Corpora 2018 Stack Overflow
Stanford Natural Language Inference (SNLI) Corpus 01.15.20 English Image captions matched with newly constructed sentences to form entailment, contradiction, or neutral pairs. 570,000 Text Inference, Entailment 2015 Bowman et al.
Switchboard Dialogue Act Corpus (SwDA) 01.21.20 English A subset of the Switchboard-1 corpus consisting of 1,155 conversations and 42 tags 1,155 UTT Dialogue Act Classification 1997 Bates et al.
T-REx 01.15.20 English Dataset contains Wikipedia abstracts aligned with Wikidata entities. 11M aligned triples JSON and NIF Relation Extraction 2018 Elsahar et al.
TabFact 01.15.20 English Dataset contains 16k Wikipedia tables as evidence for 118k human annotated statements to study fact verification with semi-structured evidence. 16,000 JSON Natural Language Inference 2020 Chen et al.
Taskmaster -2 03.29.20 English Dataset consists of 17,289 dialogs in seven domains: restaurants (3276), food ordering (1050), movies (3047), hotels (2355), flights (2481), music (1602), and sports (3478). It consists entirely of spoken two-person dialogs. 17,289 JSON Dialogue 2020 Byrne et al.
Taskmaster-1 01.15.20 English Dataset contains 13,215 task-based dialogs, including 5,507 spoken and 7,708 written dialogs created with two distinct procedures. Each conversation falls into one of six domains: ordering pizza, creating auto repair appointments, setting up ride service, ordering movie tickets, ordering coffee drinks and making restaurant reservations. 13,215 JSON Dialogue 2019 Byrne et al.
Ten Thousand German News Articles Dataset (10kGNAD) 01.15.20 German Dataset consists of 10273 german language news articles from an austrian online newspaper categorized into nine topics. 10,273 CSV Text Corpora 2019 Timo Block
Tencent AI Lab Embedding Corpus 01.15.20 Chinese Dataset provides 200-dimension vector representations, a.k.a. embeddings, for over 8 million Chinese words and phrases. 8M Text Embeddings 2018 Song et al.
TextVQA 01.15.20 English TextVQA requires models to read and reason about text in images to answer questions about them. Specifically, models need to incorporate a new modality of text present in the images and reason over it to answer TextVQA questions. 36,602 JSON, PNG Question Answering, Visual, Commonsense 2019 Singh et al.
Textbook Question Answering 01.15.20 English The M3C task builds on the popular Visual Question Answering (VQA) and Machine Comprehension (MC) paradigms by framing question answering as a machine comprehension task, where the context needed to answer questions is provided and composed of both text and images. 26,620 JSON, PNG Question Answering, Reading Comprehension, Visual 2017 Kembhavi et al.
The Arabic Parallel Gender Corpus 03.29.20 Arabic Dataset is designed to support research on gender bias in natural language processing applications working on Arabic. Requires to submit application for approval. ~12,000 n/a Gender Identification 2019 Habash et al.
The Benchmark of Linguistic Minimal Pairs (BLiMP) 01.15.20 English BLiMP is a challenge set for evaluating what language models (LMs) know about major grammatical phenomena in English. 67 sub-datasets each with 1,000 minimal pairs JSON Language Modeling 2019 Warstadt et al.
The Conversational Intelligence Challenge 2 (ConvAI2) 01.15.20 English A chit-chat dataset based on PersonaChat dataset. 3,127 JSON Dialogue 2018 NeurIPS
The Corpus of Linguistic Acceptability 01.15.20 English Dataset used to classifiy sentences as grammatical or not grammatical. 10,657 TSV Grammatical Acceptability 2018 Warstadt et al.
The Cross-lingual Natural Language Inference corpus (XNLI) 03.29.20 Multi-Lingual Dataset contains collection of 5,000 test and 2,500 dev pairs for the MultiNLI corpus. The pairs are annotated with textual entailment and translated into 14 languages: French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili and Urdu. 112,500 JSON, Text Entailment 2018 Conneau et al.
The Dialog-based Language Learning Dataset 01.15.20 English Dataset was designed to measure how well models can perform at learning as a student given a teacher’s textual responses to the student’s answer. n/a Text Question Answering, Reading Comprehension 2016 Weston
The Emotion in Text 01.21.20 English Dataset of tweets labelled with emotion. Categories: empty, sadness, enthusiasm, neutral, worry, sadness, love, fun, hate, happiness, relief, boredom, surprise, anger. 40,000 CSV Emotion Classification 2016 CrowdFlower
The Irish Times IRS 01.15.20 English Dataset contains 23 years of events from Ireland. 1,425,460 CSV Clustering, Events, Language Detection 2018 Kulkarni
The Movie Dialog Dataset 01.15.20 English Dataset measures how well models can perform at goal and non-goal orientated dialogue centered around the topic of movies (question answering, recommendation and discussion). ~3.5M Text Question Answering, Reading Comprehension 2016 Dodge et al.
The NewsReader MEANTIME Corpus 02.16.20 Multi-Lingual 480 news articles: 120 English Wikinews articles on four topics (i.e. Airbus and Boeing, Apple Inc., Stock market, and General Motors, Chrysler and Ford) and their translations in Spanish, Italian, and Dutch. Annotated with entities, events, temporal, semantic roles and event/entity coreference. 480 XML, NAF Named Entity Recognition (NER) 2016 Minard et al.
The Penn Treebank Project 01.15.20 English Naturally occurring text annotated for linguistic structure. ~1M words Text POS 1995 Marcus et al.
The SimpleQuestions Dataset 01.15.20 English Dataset for question answering with human generated questions paired with a corresponding fact, formatted as (subject, relationship, object), that provides the answer but also a complete explanation. 108,442 Text Question Answering, Reading Comprehension 2015 Bordes et al.
The Stanford Sentiment Treebank (SST) 01.15.20 English Sentence sentiment classification of movie reviews. 69,000 PTB Sentiment Analysis 2013 Socher et al.
The Story Cloze Test | ROCStories 01.15.20 English Dataset for story understanding that provides systems with four-sentence stories and two possible endings. The systems must then choose the correct ending to the story. 100,000+ JSON Question Answering, Reading Comprehension 2017 Mostafazadeh et al.
The TAC Relation Extraction Dataset (TACRED) 03.29.20 English A relation extraction dataset containing 106k+ examples covering 42 TAC KBP relation types. Costs $25 for non-members. 106,264 CoNLL, JSON Relation Extraction 2017 Yuhao et al.
The WikiMovies Dataset 01.15.20 English Dataset contains only the QA part of the Movie Dialog dataset, but using three different settings of knowledge: using a traditional knowledge base (KB), using Wikipedia as the source of knowledge, or using IE (information extraction) over Wikipedia. ~100,000 Text Question Answering, Reading Comprehension 2016 Miller et al.
The Winograd Schema Challenge 01.15.20 English Dataset to determine the correct referrent of the pronoun from among the provided choices. 150 XML Coreference Resolution 2012 Levesque et al.
Topical-Chat 01.15.20 English A knowledge-grounded human-human conversation dataset where the underlying knowledge spans 8 broad topics and conversation partners don’t have explicitly defined roles. 10,784 JSON Dialogue 2019 Gopalakrishnan et al.
Total-Text-Dataset 01.15.20 English Dataset used to classify curved text in pictures. ~1,500 JPG Scene Text Detection 2019 Ch'ng et al.
Train-O-Matic Large 03.29.20 Multi-Lingual Automatically-generated corpora in multiple languages with sense annotations for nouns using WordNet for English and BabelNet for all other languages as inventories of senses. 10M+ XML Word Sense Disambiguation  2018 Pasini et al.
Train-O-Matic Small 03.29.20 Multi-Lingual Automatically-generated corpora in multiple languages with sense annotations for nouns using WordNet for English and BabelNet for all other languages as inventories of senses. 1M+ XML Word Sense Disambiguation  2017 Pasini et al.
Translation-Augmented-LibriSpeech-Corpus (Libri-Trans) 03.05.20 English, French Dataset is an augmentation of LibriSpeech ASR and contains English utterances (from audiobooks) automatically aligned with French text. It offers ~236h of speech aligned to translated text. ~236 Hours Text, WAV Speech Translation 2018  Kocabiyikoglu et al.
Trec CAR Dataset 02.16.20 English Dataset contains topics, outlines, and paragraphs that are extracted from English Wikipedia (2016 XML dump). Wikipedia articles are split into the outline of sections and the contained paragraphs. ~285,000 CBOR Information Retrieval 2019 Dietz et al.
TrecQA 01.15.20 English Dataset is commonly used for evaluating answer selection in question answering. n/a XML Question Answering, Reading Comprehension 2007 Wang et al.
TriviaQA 01.15.20 English Dataset containing over 650K question-answer-evidence triples. It includes 95K QA pairs authored by trivia enthusiasts and independently gathered evidence documents, 6 per question on average. 650,000+ JSON Question Answering, Reading Comprehension 2017 Joshi et al.
TupleInf Open IE Dataset 01.15.20 English Dataset contains Open IE tuples extracted from 263K sentences that were used by the solver in "Answering Complex Questions Using Open Information Extraction" (referred as Tuple KB, T). 263,000 Text Knowledge Base 2017 Allen Institute
Twenty Newsgroups Dataset 01.15.20 English Messages from 20 different newsgroups. 20,000 Text Classification, Clustering 1999 Mitchell et al.
Twitter Chat Corpus 01.29.20 English Dataset contains Twitter question-answer pairs. 5M Text Dialogue 2017 Marsan Ma
Twitter Dataset for Arabic Sentiment Analysis 01.15.20 Arabic Dataset contains Arabic tweets. 2,000 Text Classification, Sentiment Analysis 2014 Abdulla
Twitter US Airline Sentiment 01.15.20 English Contributors were asked to classify positive, negative, and neutral tweets, followed by categorizing negative reasons. 14,500 CSV Classification, Sentiment Analysis 2016 Figure Eight
Twitter100k 01.15.20 English Pairs of images and tweets. 100,000 Text and Images Multi-Modal Learning 2017 Hu et al.
TyDi QA 02.06.20 Multi-Lingual TyDi QA includes question-answer pairs from 11 languages: Arabic, Bengali, English, Finnish, Indonesian, Kiswahili, Russian. Japanese, Korean, Thai, and Telugu. 204,000 JSON Question Answering, Reading Comprehension 2020 Clark et al.
Ubuntu Dialogue Corpus 01.15.20 English Dialogues extracted from Ubuntu chat stream on IRC. 930,000  CSV Text Corpora, Dialogue 2015 Lowe et al.
United Nations Parallel Corpus 02.06.20 Multi-Lingual Parallel corpus presented consists of manually translated UN documents from the last 25 years (1990 to 2014) for the six official UN languages: Arabic, Chinese, English, French, Russian, and Spanish. 799,276 TEI, XML Machine Translation 2016 Ziemski et al.
Urban Dictionary Dataset 01.15.20 English Corpus of words, votes and definitions. 2,606,522 CSV Reading Comprehension 2016-05 Anonymous
UseNet Corpus 01.15.20 English UseNet forum postings. 7B Text Dialogue 2011 Shaoul et al.
Video Commonsense Reasoning (VCR) 01.15.20 English Dataset contains 290K multiple-choice questions on 110K images. 290,000 JSON, JPG Question Answering, Visual, Commonsense 2018 Zellers et al.
VisDial 01.29.20 English Dataset contains images from COCO training set, and dialogues. Meant to be used for model to be trained in answering questions about images during conversation. Contains 1.2M dialog question-answers. 1.2M JSON Question Answering, Visual, Dialogue 2017 Das et al.
Visual QA (VQA) 01.15.20 English Dataset containing open-ended questions about images. These questions require an understanding of vision, language and commonsense to answer. 265,016 images JSON Visual Question Answering 2015 Antol et al.
Voices Obscured in Complex Environmental Settings (VOiCES) 01.15.20 English Dataset contains a total of 15 hours (3,903 audio files) in male and female read speech. n/a Wav Speech Recognition 2018 Various
VoxCeleb 01.15.20 Multi-Lingual An audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube. n/a MD5, URL Speech Recognition, Visual 2017 Nagrani et al.
WAT 2019 Hindi-English 03.29.20 Hindi, English Dataset consists of multimodal English-to-Hindi translation. It inputs an image, rectangular region in the image and english caption. It outputs a caption in Hindi. 32,925 Text, JPEG Machine Translation, Multi-Modal Learning 2019 Parida et al.
WMT 14 English-German 01.15.20 Multi-Lingual Sentence pairs for translation. 4.5M Text Machine Translation 2015 Stanford
WMT 15 English-Czech 01.15.20 Multi-Lingual Sentence pairs for translation. 15.8M Text Machine Translation 2016 Stanford
WMT 19 Multiple Datasets 01.15.20 Multi-Lingual Multiple text corpora in multiple languages. n/a Text Text Corpora, Machine Translation 2019 ACL Workshop
WSD English All-Words Fine-Grained Datasets 03.29.20 English Unified five standard all-words Word Sense Disambiguation datasets. 7,000+ XML Word Sense Disambiguation  2017 Raganato et al.
Watan-2004 Corpus 03.29.20 Arabic Dataset contains about 20,000 articles talking about 6 topics: culture, religion, economy, local news, international news and sports. 20,000 HTML Text Corpora 2004 Abbas et al.
Web of Science Dataset 01.15.20 English Hierarchical Datasets for Text Classification. 46,985 Text Classification 2017 Kowsari et al.
WebQuestions Semantic Parses Dataset 01.15.20 English Dataset contains full semantic parses in SPARQL queries for 4,737 questions, and “partial” annotations for the remaining 1,073 questions for which a valid parse could not be formulated or where the question itself is bad or needs a descriptive answer. 5,810 JSON Semantic Parsing 2016 Yih et al.
Webis-CLS-10 03.29.20 Multi-Lingual The Cross-Lingual Sentiment (CLS) dataset comprises about 800,000 Amazon product reviews in the 4 languages: English, German, French, and Japanese. 800,000 Tar Classification, Sentiment Analysis 2010 Prettenhofer et al.
Webis-Snippet-20 Corpus 03.29.20 English Dataset comprises four abstractive snippet dataset from ClueWeb09, Clueweb12, and DMOZ descriptions. More than 10 million <webpage, abstractive snippet> pairs / 3.5 million <query, webpage, abstractive snippet> pairs were collected. 3.5M JSON Summarization 2020 Chen et al.
Webis-TLDR-17 Corpus 03.29.20 English Dataset contains 3 Million pairs of content and self-written summaries mined from Reddit. It is one of the first large-scale summarization dataset from the social media domain. 3,084,410 JSON Summarization 2017 Volske et al.
Web Inventory of Transcribed and Translated Talks (WIT3) 01.29.20 Multi-Lingual Dataset contains a collection of transcribed and translated talks. The core of the dataset is from Ted Talks corpus. As of 2016, It holds 109 languages. n/a XML Machine Translation 2012 Cettolo et al.
Who Did What Dataset 01.15.20 English Dataset contains over 200,000 fill-in-the-gap (cloze) multiple choice reading comprehension problems constructed from the LDC English Gigaword newswire corpus. 200,000K XML Question Answering, Reading Comprehension 2016 Onishi et al.
WikiAnn 03.29.20 Multi-Lingual Dataset with NER annotations for PER, ORG and LOC. It has been constructed using the linked entities in Wikipedia pages for 282 different languages. 95,924 JSON Named Entity Recognition (NER) 2017 Pan et al.
WikiHow 01.15.20 English Dataset contains article and summary pairs extracted and constructed from an online knowledge base written by different human authors. 230,000+ Text Text Corpora, Summarization 2018 Koupaee et al.
WikiLinks 01.15.20 English Dataset contains 40 million mentions over 3 million entities based on hyperlinks from Wikipedia. ~10M Text Text Corpora 2012 Singh et al.
WikiMatrix 02.16.20 Multi-Lingual Dataset contains 135 million parallel sentences for 1,620 different language pairs in 85 different languages. 135M TSV Machine Translation 2019 Schwenk et al.
WikiQA Corpus 01.15.20 English Dataset contains Bing query logs as the question source. Each question is linked to a Wikipedia page that potentially has the answer.  3,047 TSV Question Answering, Reading Comprehension 2015 Yang et al.
WikiReading 01.29.20 Multi-Lingual The task is to predict textual values from the structured knowledge base Wikidata by reading the text of the corresponding Wikipedia articles. Includes English, Russian and Turkish. 18M JSON Knowledge Base, NLU 2016 Hewlett & Kenter et al.
WikiSQL 02.16.20 English A large collection of automatically generated questions about individual tables from Wikipedia. 80,654 JSON Semantic Parsing, Text-to-SQL 2017 Zhong et al.
WikiSplit 02.16.20 English Dataset contains 1 million English sentences, each split into two sentences that together preserve the original meaning, extracted from Wikipedia edits. 1M TSV Sentence Simplification 2018 Botha et al.
WikiText-103 & 2 02.06.20 English Dataset contains word and character level tokens extracted from Wikipedia 100M+ TOKENS Language Modeling 2016 Merity et al.
Wikidata NE dataset 02.06.20 English, German Dataset has 2 parts: the Named Entity files and the link files. The Named Entity files include the most important information about the entities, whereas the link files contain the links and ids in other databases. n/a JSON Named Entity Recognition, Knowledge Base 2017 Geiß et al.
Wikipedia 02.16.20 English The 2016-12-21 dump of English Wikipedia. 5,075,182 SQL Text Corpora 2016 Facebook Research
Wikipedia News Corpus 03.29.20 English Text from Wikipedia's current events page with dates. ~25,000 Text Text Corpora 2019 Parth Parikh
WinoGrande 01.29.20 English Formulated as a fill-in-a-blank task with binary options, the goal is to choose the right option for a given sentence which requires commonsense reasoning. 44,000 JSON Commonsense Reasoning 2019 Sakaguchi et al.
Winogender Schemas 01.15.20 English Dataset with pairs of sentences that differ only by the gender of one pronoun in the sentence, designed to test for the presence of gender bias in automated coreference resolution systems. 720 TSV Coreference Resolution 2018 Rudinger et al.
Wisesight Sentiment Corpus 02.06.20 Thai Dataset contains around 26,700 messages in Thai language from various social media with human-annotated sentiment classification (positive, neutral, negative, and question). ~26,700 Text Classification, Sentiment Analysis 2019 Wisesight
Words in Context 01.15.20 English Dataset for evaluating contextualized word representations. 2,400 Text Word Sense Disambiguation 2019 Pilehvar et al.
Worldwide News - Aggregate of 20K Feeds 01.15.20 Multi-Lingual One week snapshot of all online headlines in 20+ languages. 1,398,431 CSV Clustering, Events, Machine Translation 2017 Kulkarni
X-Stance 03.29.20 Multi-Lingual Dataset contains more than 150 political questions, and 67k comments written by candidates on those questions. The questions are available in German, French, Italian and English. 67,000 JSON Stance Detection 2020 Vamvas et al.
X-Sum 03.05.20 English The XSum dataset consists of 226,711 Wayback archived BBC articles (2010 to 2017) and covering a wide variety of domains: News, Politics, Sports, Weather, Business, Technology, Science, Health, Family, Education, Entertainment and Arts. 226,711 JSON Summarization 2018 Narayan et al.
XQuAD 03.05.20 Multi-Lingual Dataset consists of a subset of 240 context paragraphs and 1,190 question-answer pairs from the development set of SQuAD v1.1 with their translations in 10 languages: Spanish, German, Greek, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, and Hindi. 1,190  JSON Question Answering, Reading Comprehension 2019 Artetxe et al.
Yahoo! Music User Ratings of Musical Artists 01.15.20 English Over 10M ratings of artists by Yahoo users. May be used to validate recommender systems or collaborative filtering algorithms. ~10M Text Clustering, PCA 2004 Yahoo!
Yelp Open Dataset 01.15.20 English Dataset containing millions of reviews on Yelp. In addition it contains business data including location data, attributes, and categories. 6,685,900 JSON Classification, Sentiment Analysis 2015 Yelp
YouTube Comedy Slam Preference Dataset 01.15.20 English User vote data for pairs of videos shown on YouTube. Users voted on funnier videos. 1,138,562 Text Classification 2012 Google
arXiv Bulk Data 01.15.20 English A collection of research papers on arXiv. n/a Tar Text Corpora 2011 n/a
bAbI 20 Tasks 01.15.20 English, Hindi Dataset cotains a set of contexts, with multiple question-answer pairs available based on the contexts. 2,000 Text Question Answering, Reading Comprehension 2015 Weston et al.
babI 6 Tasks Dialogue 01.15.20 English Dataset contains 6 tasks for testing end-to-end dialog systems in the restaurant domain. 3,000 Text Dialogue 2017 Bordes et al.