1 Billion Word Language Model Benchmark (lm1b) 06.25.20 English Dataset used for measuring progress in statistical language modeling. 1.1B n/a Language Modeling 2013 Chelba et al.
1.5 billion Words Arabic Corpus 03.29.20 Arabic The data were collected from newspaper articles in ten major news sources from eight Arabic countries, over a period of fourteen years. 5M XML Text Corpora 2016 El-khair et al.
A Conversational Question Answering Challenge (CoQA) 01.15.20 English Dataset for measuring the ability of machines to understand a text passage and answer a series of interconnected questions that appear in a conversation. 127,000+ JSON Question Answering, Reading Comprehension 2019 Redy et al.
A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning (CLEVR & CoGenT) 01.29.20 English Visual question answering dataset contains 100,000 images and 999,968 questions. 999,968 questions; 100,000 images JSON Question Answering, Visual 2016 Johnson et al.
A Novel Approach to a Semantically-Aware Representation of Items (NASARI) 02.16.20 Multi-Lingual Dataset contains semantic vector representations for BabelNet synsets and Wikipedia pages in several languages: English, Spanish, French, German and Italian. Currently available three vector types: lexical, unified and embedded. 610K-4.4M depending on language Text Semantic Textual Similarity 2016 Camacho-Collados et al.
A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs (DROP) 01.15.20 English Dataset is used to resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting). 96,000 JSON Question Answering, Reading Comprehension 2019 Dua et al.
ABC Australia News Corpus 01.15.20 English Entire news corpus of ABC Australia from 2003 to 2019. 1,186,018 CSV Text Corpora 2019 Rohit Kulkarni
AG News 02.06.20 English Dataset contains more than 1 million news articles for topic classification. The 4 classes are: World, Sports, Business, and Sci/Tech. 1M+ CSV Classification 2015 Zhang et al.
AI2 Reasoning Challenge (ARC) 01.15.20 English Dataset contains 7,787 genuine grade-school level, multiple-choice science questions. 7,787 JSON, CSV Question Answering, Reading Comprehension 2018 Clark et al.
AI2 Science Questions Mercury 01.15.20 English Dataset consists of questions used in student assessments across elementary and middle school grade levels. Includes questions with diagrams and without. 6,940 JSON, JPG Reading Comprehension 2017 Allen Institute
AI2 Science Questions v2.1 01.15.20 English Dataset consists of questions used in student assessments in the United States across elementary and middle school grade levels. Each question is 4-way multiple choice format and may or may not include a diagram element. 5,060 JSON, CSV Question Answering, Reading Comprehension 2017 Allen Institute
AQuA 01.15.20 English Dataset containing algebraic word problems with rationales for their answers. 100,000 JSON Question Answering, Reading Comprehension 2017 Ling et al.
ASTD: Arabic Sentiment Tweets Dataset 03.29.20 Arabic Dataset contains over 10k Arabic sentiment tweets classified into 4 classes: subjective positive, subjective negative, subjective mixed, and objective. 10,000+ Text Classification, Sentiment Analysis 2015 Nabil et al.
ASU Twitter Dataset 01.15.20 English Twitter network data, not actual tweets. Shows connections between a large number of users. 11,316,811 users, 85,331,846 connections CSV Clustering, Graph Analysis 2009 Zafarani et al.
ATIS 02.16.20 English Dataset is a collection of utterances to a flight booking system, accompanied by a relational database and SQL queries to answer the questions. 877 JSON Semantic Parsing, Text-to-SQL 2017 Dahl/Iyer et al.
Abductive Natural Language Inference (aNLI) 01.29.20 English Dataset is a binary-classification task, the goal is to pick the most plausible explanatory hypothesis given two observations from narrative contexts. It contains 20k commonsense narrative contexts and 200k explanations." 20,000 JSON Classification, Commonsense 2019 Bhagavatula et al.
Abstract Meaning Respresentation (AMR) Bank 05.26.20 English Dataset contains a sembank (semantic treebank) of over 59,255 English natural language sentences from broadcast conversations, newswire, weblogs, web discussion forums, fiction and web text. 59,255 Text Information Extraction, Semantic Role Labeling 2020 Knight et al.
Academic 02.16.20 English Questions about the Microsoft Academic Search (MAS) database, derived by enumerating every logical query that could be expressed using the search page of the MAS website and writing sentences to match them. 196 JSON Semantic Parsing, Text-to-SQL 2014 Li et al.
Action Learning From Realistic Environments and Directives (ALFRED) 05.04.20 English Dataset contains 8k+ expert demostrations with 3 or more language annotations each comprising of 25,000 language directives. A trajectory consists of a sequence of expert actions, the corresponding image observations, and language annotations describing segments of the trajectory. 8,055 JSON Multi-Modal Learning 2020 Shridhar et al.
Activitynet-QA 01.15.20 English Dataset contains 58,000 human-annotated QA pairs on 5,800 videos derived from the popular ActivityNet dataset. The dataset provides a benckmark for testing the performance of VideoQA models on long-term spatio-temporal. 58,000 JSON Question Answering, Visual, Commonsense 2019 Yu et al.
Adversarial NLI (ANLI) 05.26.20 English Dataset is an NLI benchmark created via human-and-model-in-the-loop enabled training (HAMLET). Human was tasked to provide a hypothesis that fools the model into misclassifying the label. 169,265 JSON Natural Language Inference (NLI) 2020 Nie et al.
Adverse Drug Effect (ADE) Corpus 06.25.20 English There's 3 different datasets: DRUG-AE.rel provides relations between drugs and adverse effects, DRUG-DOSE.rel provides relations between drugs and dosages and ADE-NEG.txt provides all sentences in the ADE corpus that DO NOT contain any drug-related adverse effects. 2,972 Text Information Extraction 2012 Gurulingappa et al.
Advising 02.16.20 English Dataset contains questions regarding course information at the University of Michigan, but with fictional student records. 4,570 JSON Semantic Parsing, Text-to-SQL 2018 Finegan-Dollak et al.
Affective Text 01.21.20 English Classification of emotions in 250 news headlines. Categories: anger, disgust, fear, joy, happiness, sadness, surprise. 250 SGML, Text Emotion Classification 2007 Strapparava et al.
AirDialogue 06.25.20 English Dataset contains 402,038 goal-oriented conversations. 402,038 JSON Dialogue 2018 Wei et al.
All the News 2.0 05.04.20 English Dataset contains 2.7 million articles from 26 different publications from January 2016 to April 1, 2020. 2.7M CSV Text Corpora 2020 Andrew Thompson
Amazon Fine Food Reviews 01.15.20 English Dataset consists of reviews of fine foods from amazon. 568,454 CSV Classification, Sentiment Analysis 2013 McAuley et al.
Amazon Reviews 01.15.20 English US product reviews from Amazon. 233.1M JSON Classification, Sentiment Analysis 2018 McAuley et al.
AmbigNQ 05.26.20 English Dataset covering 14,042 questions from NQ-open, an existing open-domain QA benchmark. 14,042 JSON Question Answering, Reading Comprehension 2020 Min et al.
An Open Information Extraction Corpus (OPIEC) 01.15.20 English OPIEC is an Open Information Extraction (OIE) corpus, constructed from the entire English Wikipedia containing more than 341M triples. 341M AVRO Knowledge Base, Information Extraction 2019 Gashteovski et al.
Annotated Enron Subject Line Corpus (AESLC) 06.25.20 English Dataset contains email messages of employees in the Enron Corporation. 18,302 Text Summarization 2019 Zhang et al.
Arabic Jordanian General Tweets (AJGT) 03.29.20 Arabic Dataset consists of 1,800 tweets annotated as positive and negative. Modern Standard Arabic (MSA) or Jordanian dialect. 1,800 Excel Classification, Sentiment Analysis 2017 Alomari
Arabic Reading Comprehension Dataset (ARCD) 03.05.20 Arabic Dataset contains 1,395 questions posed by crowdworkers on Wikipedia articles, and a machine translation of the Stanford Question Answering Dataset (Arabic-SQuAD) containing 48,344 questions. ~50,000 JSON Question Answering, Reading Comprehension 2019 Mozannar et al.
Arabic Speech Corpus 03.29.20 Arabic Dataset was recorded in south Levantine Arabic (Damascian accent) using a professional studio. Synthesized speech as an output using this corpus has produced a high quality, natural voice. n/a WAV, LAB Speech Corpora 2016 Halabi
Arabic Violence Twitter Corpus 02.16.20 Arabic Annotated Arabic tweets which mention a violent act. Tweets were classifed into 8 classes: Crime, Accident, Crisis, Conflict, Human Rights Abuse, Violence, Opinion, or other. Requires using Twitter API to match IDs with tweets for retrieval. 20,000 Text Classification 2016 Ayman et al.
Arabic in Business and Management Corpora (ABMC) 05.26.20 Arabic Dataset contains 400 Arab companies chairman and chief executive manager statements, 400 Arabic economic news articles, 400 Arabic stock market news articles. 1,200 Text Text Corpora 2016 El-Haj et al.
ArabicWeb16 03.29.20 Arabic Dataset contains 150,211,934 Arabic Web pages with high coverage of dialectal Arabic as well as Modern Standard Arabic (MSA). 150M WARC Text Corpora 2016 Suwaileh et al.
Argentinian Spanish [es-ar] Speech Multi-Speaker Dataset 01.21.20 Spanish (Argentinan) Speech dataset containing about 5,900 transcribed high-quality audio from Argentinian Spanish [es-ar] sentences recorded by volunteers. ~5,900 Wav Speech Recognition 2018 Google
ArguAna TripAdvisor Corpus 03.05.20 English Dataset contains 2,100 hotel reviews balanced with respect to the reviews’ sentiment scores. reviews are segmented into subsentence-level statements that have been manually classified as a fact, a positive, or a negative opinion. 2,100 XMI Classification, Sentiment Analysis 2014 Wachsmuth et al.
Aristo Tuple KB 01.15.20 English Dataset contains a collection of high-precision, domain-targeted (subject,relation,object) tuples extracted from text using a high-precision extraction pipeline, and guided by domain vocabulary constraints. 282,594 TSV Knowledge Base 2017 Dalvi et al.
ArxivPapers 06.25.20 English Dataset is a corpus of over 100,000 scientific papers related to machine learning. 104,723 CSV Text Corpora 2020 Paperswithcode
Atlas of Machine Commonsense (ATOMIC) 05.26.20 English Dataset is a knowledge graph of 877K textual description triples of inferential knowledge. 877,000 CSV Commonsense, Knowledge Graph 2018 Sap et al.
Audio Visual Scene-Aware Dialog (AVSD) 05.26.20 English Dataset consists of text-based human conversations about short videos from the Charades dataset. 11,816 JSON Multi-Modal Learning, Video Question Answering, Dialogue 2019 Alamri et al.
AudioSet 01.15.20 Multi-Lingual Dataset consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos. n/a CSV, TFR Speech Recognition, Visual 2017 Google
Automated Essay Scoring 01.15.20 English Dataset contains student-written essays with scores. n/a TSV, xlsx Scoring Classification 2017 The Hewlett Foundation
Automatic Keyphrase Extraction 01.15.20 English Multiple datasets for automatic keyphrase extraction. n/a Multiple Information Retrieval 1999-2008 Several
BARD Bangla Article Classifier 05.04.20 Bengali A large corpus of Bangla documents classified into 5 classes: sports, state, economy, entertainment, and international. 376,226 Text Classification 2018 Alam et al.
BSNLP-2019 03.29.20 Multi-Lingual Dataset used to classify named entities in web documents in Slavic languages, their lemmatization, and cross-language matching. Dataset covers 4 languages: Bulgarian, Czech, Polish, and Russian. n/a Text, OUT Named Entity Recognition (NER), Entity Linking 2019 Piskorski et al.
Background Knowledge Dialogue Dataset 03.05.20 English Dataset containing movie chats wherein each response is explicitly generated by copying and/or modifying sentences from unstructured background knowledge such as plots, comments and reviews about the movie. 90,000 JSON Dialogue 2018 Moghe et al.
BanFakeNews 05.04.20 Bengali A Dataset for detecting fake news in Bangla. News articles were scraped from news portals in Bengladesh. n/a CSV Classification, Fake News Detection 2020 Hossain et al.
Bianet 03.05.20 Multi-Lingual Dataset is a parallel news corpus with 3,214 Turkish articles with their sentence-aligned Kurdish or English translations from the Bianet online newspaper. Requires a request submission for dataset. 3,214 XML Machine Translation 2018 Ataman et al.
Bible Corpus 03.05.20 Multi-Lingual A parallel corpus created from translations of the Bible containing 102 languages. 2.84M XML Machine Translation 2014 Christodoulopoulos et al.
BigPatent 05.26.20 English Dataset consists of 1.3 million records of U.S. patent documents along with human written abstractive summaries. 1.3M n/a Summarization 2019 Sharma et al.
BillSum 06.25.20 English Dataset contains a summarization of US Congressional and California state bills. 22,218 JSON Summarization 2019 Kornilova et al.
BlogFeedback Dataset 01.15.20 English Dataset to predict the number of comments a post will receive based on features of that post. 60,021 Text Regression 2014 Buza
Blogger Authorship Corpus 01.15.20 English Blog post entries of 19,320 people from blogger.com. 681,288 Text Classification, Sentiment Analysis 2006 Schler et al.
Book Depository Dataset 05.04.20 English Dataset contains books from bookdepository.com, not the actual content of the book but a list of metadata like title, description, dimensions, category and others. n/a CSV Topic Modeling, Classification 2020 Simakis
Books Corpus 02.06.20 Multi-Lingual Dataset contains a collection of copyright free books. Corpus consists of 16 languages and 0.91M sentence fragments and 19.50M tokens. 0.91M XCES, XML Machine Translation 2012 Tiedemann
BoolQ 01.15.20 English Question answering dataset for yes/no questions. 15,942 JSON Binary Question Answering 2019 Clark et al.
Break 02.16.20 English Dataset contains 83,978 examples sampled from 10 question answering datasets over text, images and databases. Dataset used to obtain the Question Decomposition Meaning Representation (QDMR) for questions. 83,978 CSV Natural Question Understanding (NQU) 2020 Wolfson et al.
BuGL 05.04.20 English Dataset consists of 54 GitHub projects of four different programming languages namely C, C++, Java and Python with around 10,187 issues. 10,187 JSON, Xlsx Text Corpora 2020 muvvasandeep
Buzz in Social Media Dataset 01.15.20 English Data from Twitter and Tom's Hardware. This dataset focuses on specific buzz topics being discussed on those sites. 140,000 Text Classification 2013 Kawala et al.
BuzzFace 05.26.20 English Dataset focused on news stories (which are annotated for veracity) posted to Facebook during September 2016 consisting of: Nearly 1.7 million Facebook comments discussing the news content, Facebook plugin comments, Disqus plugin comments, Associated webpage content of the news articles. 2,263 JSON Classification, Fake News Detection 2018 Santia et al.
CAPES 03.05.20 English, Portuguese A parallel corpus of theses and dissertation abstracts in Portuguese and English from CAPES. 2.32M XML Machine Translation 2012 Tiedemann et al.
CASS 03.05.20 French Dataset is composed of decisions made by the French Court of cassation and summaries of these decisions made by lawyer. 129,445 XML Summarization 2019 Bouscarrat et al.
CCMatrix 02.16.20 Multi-Lingual 4.5 billion parallel sentences in 576 language pairs pulled from snapshots of the CommonCrawl public dataset. 4.5B to be added soon Machine Translation 2019 Schwenk et al.
CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) 01.15.20 English Dataset contains more than 23,500 sentence utterance videos from more than 1000 online YouTube speakers. The dataset is gender balanced. All the sentences utterance are randomly chosen from various topics and monologue videos. 23,500 n/a Sentiment Analysis, Emotion Recognition, Visual 2018 MultiComp Lab
CNN / Daily Mail Dataset 01.15.20 English Cloze-style reading comprehension dataset created from CNN and Daily Mail news articles. 1M+ Question Question Answering, Reading Comprehension 2015 Hermann et al.
COVID-19 Open Research Dataset (CORD-19) 03.29.20 English Dataset contains 44,000 scholarly articles, including over 29,000 with full text, about COVID-19 and the coronavirus family of viruses for use by the global research community. 44,000 JSON Text Corpora 2020 Allen Institute
COVID-19 Twitter Chatter Dataset 06.25.20 Multi-Lingual Dataset contains over 152 million tweets, growing daily, related to COVID-19 chatter generated from January 1st, 2020 to present. 152M+ TSV Text Corpora 2020 Banda et al.
COmmonsense Dataset Adversarially-authored by Humans (CODAH) 01.15.20 English Commonsense QA in the sentence completion style of SWAG. As opposed to other automatically generated NLI datasets, CODAH is adversarially constructed by humans who can view feedback from a pre-trained model and use this information to design challenging commonsense questions. 2,776 TSV Question Answering, Reading Comprehension, Commonsense 2019 Chen et al.
CSTR VCTK Corpus 06.25.20 English Dataset contains speech data uttered by 109 native speakers of English with various accents. Each speaker reads out about 400 sentences, most of which were selected from a newspaper plus the Rainbow Passage and an elicitation paragraph intended to identify the speaker's accent. n/a n/a Text-to-Speech 2017 Veaux et al.
Car Evaluation Dataset 01.15.20 English Car properties and their overall acceptability. 1,728 Text Classification 1997 Bohanec
Children’s Book Test (CBT) 01.15.20 English Dataset contains ‘questions’ from chapters in the book by enumerating 21 consecutive sentences. In each question, the first 20 sentences form the context, and a word is removed from the 21st sentence, which becomes the query. Models must identify the answer word among a selection of 10 candidate answers appearing in the context sentences and the query. ~688,000 Text Question Answering, Reading Comprehension 2016 Hill et al.
Chinese Machine Reading Comprehension (CMRC 2018) 03.29.20 Chinese Dataset is composed by near 20,000 real questions annotated on Wikipedia paragraphs by human experts. 20,000 JSON Question Answering, Reading Comprehension 2018 Cui et al.
Chinese Machine Reading Comprehension (CMRC) 05.04.20 Chinese Dataset (cloze style) contains over 100K blanks (questions) within over 10K passages, which was originated from Chinese narrative stories. 10,438 JSON Reading Comprehension 2020 Cui et al.
Choice of Plausible Alternatives (COPA) 01.15.20 English Dataset used for open-domain commonsense causal reasoning. 1,000 XML Commonsense Reasoning 2011 Roemmele et al.
Civil Comments 06.25.20 English Dataset contains the archive of the Civil Comments platform. Dataset was annotated for toxicity. n/a CSV Classification 2019 Jigsaw/Conversation AI
ClarQ 06.25.20 English Dataset consists of ∌2M question/post tuples distributed across 173 domains of stackexchange. ~2M JSON Clarification Question Generation 2020 Kumar et al.
Clash of Clans 05.26.20 English Dataset contains 50K user comments, both from the iTunes App Store and Google Play. The dataset spans from Oct 18, 2018 to Feb 1, 2019. 50,000 CSV Sentiment Analysis 2019 Issa Annamoradnejad
Classify Emotional Relationships of Fictional Characters 01.21.20 English Dataset contains 19 short stories that are shorter than 1,500 words, and depict at least four different characters. 19 Text Text Corpora, Emotion Classification 2019 Kim et al.
Clinical Case Reports for Machine Reading Comprehension (CliCR) 01.15.20 English Dataset was built from clinical case reports, requiring the reader to answer the query with a medical problem/test/treatment entity. 100,000 JSON Question Answering, Reading Comprehension 2018 Ć uster et al.
ClueWeb Corpora 01.15.20 English Annotated web pages from the ClueWeb09 and ClueWeb12 corpora. 340,451,982 Text Classification 2013 Gabrilovich et al.
Coached Conversational Preference Elicitation 01.15.20 English Dataset consisting of 502 English dialogs with 12,000 annotated utterances between a user and an assistant discussing movie preferences in natural language. 12,000 JSON Dialogue 2019 Radlinski et al.
Coarse Discourse 02.16.20 English Dataset contains discourse annotations and relations on threads from Reddit during 2016. Requires merging using Reddit API. 9,473 JSON Text Corpora 2017 Zhang et al.
Code-Mixed-Dialog 03.05.20 Multi-Lingual A goal-oriented dialog dataset containing code-mixed conversations. Specifically, text from the DSTC2 restaurant reservation dataset and create code-mixed versions of it in Hindi-English, Bengali-English, Gujarati-English and Tamil-English. 49,167 Text Dialogue 2018 Banerjee et al.
CodeSearchNet Corpus 06.25.20 English Dataset contains functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and Ruby from open source projects on GitHub. 6M JSON Text Corpora 2019 Husain et al.
ColBERT 05.26.20 English Dataset contains 200k short texts (100k positive, 100k negative). Used for humor detection. 200,000 CSV Classification, Humor Detection 2020 Annamoradnejad et al.
CommitmentBank 01.15.20 English Dataset contains naturally occurring discourses whose final sentence contains a clause-embedding predicate under an entailment canceling operator (question, modal, negation, antecedent of conditional). 1,200 CSV Natural Language Inference (NLI) 2019 Marneffe et al.
Common Objects in Context (COCO) 01.29.20 English COCO is a large-scale object detection, segmentation, and captioning dataset. Dataset contains 330K images (>200K labeled) 1.5 million object instances, 80 object categories, 91 stuff categories, 5 captions per image. 330,000 JSON, JPG Automatic Image Captioning 2014 Lin et al.
Common Sense Explanations (CoS-E) 06.25.20 English Dataset used to train language models to automatically generate explanations that can be used during training and inference in a novel Commonsense Auto-Generated Explanation (CAGE) framework. 19,522 JSON Commonsense 2019 Rajani et al.
Common Voice 01.15.20 Multi-Lingual Dataset containing audio in 29 languages and 2,454 recorded hours . n/a MP3 Speech Recognition 2019 Mozilla
CommonCrawl 01.15.20 Multi-Lingual Dataset contains data from 25 billion web pages. 25B WET Text Corpora 2013-2019 Common Crawl Foundation
CommonGen 03.29.20 English Dataset consists of 30k concept-sets with humanwritten sentences as references. 30,000 JSON Text Generation 2019 Lin et al.
CommonsenseQA 01.15.20 English Dataset contains multiple-choice question answering dataset that requires different types of commonsense knowledge to predict the correct answers . It contains 12,102 questions with one correct answer and four distractor answers. 12,012 JSON Question Answering, Reading Comprehension, Commonsense 2018 Talmor et al.
CompGuessWhat?! 06.25.20 English Dataset contains 65,700 dialogues based on GuessWhat?! dataset dialogues and enhanced by including object attributes coming from resources such as VISA attributes, VisualGenome and ImSitu. 65,700 JSON Grounded Language Learning, Visual 2020 Suglia et al.
Complex Factoid Question Answering with Paraphrase Clusters (ComQA) 02.16.20 English The dataset contains questions with various challenging phenomena such as the need for temporal reasoning, comparison (e.g., comparatives, superlatives, ordinals), compositionality (multiple, possibly nested, subquestions with multiple entities), and unanswerable questions. 11,214 JSON Question Answering, Reading Comprehension 2019 Abujabal et al.
Complex Sequential Question Answering (CSQA) 05.04.20 English Dataset contains around 200K dialogs with a total of 1.6M turns. Further, unlike existing large scale QA datasets which contain simple questions that can be answered from a single tuple, the questions in the dialogs require a larger subgraph of the KG. 200,000 n/a Question Answering, Knowledge Base 2018 Saha et al.
ComplexWebQuestions 01.15.20 English Dataset contains a large set of complex questions in natural language, and can be used in multiple ways. 34,689 JSON Question Answering, Reading Comprehension 2018 Talmor et al.
Compositional Distributional Semantics Corpus (CDSC | E & R) 05.26.20 Polish Dataset is s human-annotated for semantic relatedness and entailment by 3 human judges experienced in Polish linguistics. 10,000 TSV Natural Language Inference (NLI) 2017 Wroblewska et al.
Compositional Freebase Questions (CFQ) 05.04.20 English Dataset contains questions and answers that also provides for each question a corresponding SPARQL query against the Freebase knowledge base. 239,357 JSON Question Answering, Knowledge Base 2020 Keysers et al.
ConceptNet 05.26.20 Multi-Lingual A knowledge graph that connects words and phrases of natural language (terms) with labeled, weighted edges (assertions). 21M+ edges and 8M+ nodes JSON Commonsense, Knowledge Graph 2017 Speer et al.
Conceptual Captions 01.15.20 English Dataset contains ~3.3M images annotated with captions to be used for the task of automatically producing a natural-language description for an image. 3,318,333 TSV Automatic Image Captioning 2018 Sharma et al.
Conference on Computational Natural Language Learning (CoNLL 2002) 02.16.20 Spanish, Dutch Spanish data is a collection of newswire articles made available by the Spanish EFE News Agency.The Dutch data consist of four editions of the Belgian newspaper "De Morgen" of 2000. IOB2 format. n/a HTML Named Entity Recognition (NER) 2002 Tjong et al.
Conference on Computational Natural Language Learning (CoNLL 2003) 02.06.20 English, German Dataset contains news articles whose text are segmented in 4 columns: the first item is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. English 1,393; German 909 Tar Named Entity Recognition (NER), Part-of-Speech (POS) 2003 Sang et al.
Content-Based Categorized Dataset 03.29.20 Arabic Dataset contains 996 Web pages from the ArabicWeb16 dataset were extracted and labeled. 996 Text Text Classification 2016 Suwaileh et al.
Conversational Text-to-SQL Systems (CoSQL) 01.15.20 English Dataset consists of 30k+ turns plus 10k+ annotated SQL queries, obtained from a Wizard-of-Oz collection of 3k dialogues querying 200 complex databases spanning 138 domains.It is the dilaogue version of the Spider and SParC tasks. 3,000 JSON, SQL Dialogue, SQL-to-Text 2019 Yu et al.
Cornell Movie--Dialogs Corpus 01.15.20 English This corpus contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts. 220,579 conversational exchanges between 10,292 pairs of movie characters, involves 9,035 characters from 617 moviesin. total 304,713 utterances. 304,713 Text Dialogue 2011 Danescu et al.
Cornell Natural Language for Visual Reasoning (NLVR and NLVR2) 01.29.20 English Dataset contains two language grounding datasets containing natural language sentences grounded in images. The task is to determine whether a sentence is true about a visual input. NLVR2 107,292; NLVR 92,244 JSON Question Answering, Visual 2019 Suhr et al.
Cornell Newsroom 01.15.20 English Dataset contains 1.3 million articles and summaries written by authors and editors in the newsrooms of 38 major publications. The summaries are obtained from search and social metadata between 1998 and 2017. 1.3M JSON Text Corpora, Summarization 2018 Grusky et al.
Corporate Messaging Corpus 01.15.20 English Dataset contains classifed statements as information, dialog (replies to users, etc.), or action (messages that ask for votes or ask users to click on links, etc. 3,118 CSV Classification 2015 Crowdflower
Cosmos QA 01.15.20 English Dataset containing thousands of problems that require commonsense-based reading comprehension, formulated as multiple-choice questions. 35,000 CSV Question Answering, Reading Comprehension, Commonsense 2019 Huang et al.
Credbank 05.26.20 English Dataset comprises more than 60M tweets grouped into 1,049 real-world events, each annotated by 30 human annotators. 60M n/a Credibility 2015 Mitra et al.
Crema-D 06.25.20 English Dataset consists of facial and vocal emotional expressions in sentences spoken in a range of basic emotional states (happy, sad, anger, fear, disgust, and neutral). 7,442 clips of 91 actors with diverse ethnic backgrounds were collected. 7,438 Wav, MP3, Flash Emotion Recognition, Multi-Modal 2014 Cao et al.
Cross-lingual Choice of Plausible Alternatives (XCOPA) 05.26.20 Multi-Lingual Dataset is the translation and reannotation of the English COPA and covers 11 languages: Estonian, Haitian Creole, Indonesian, Italian, Quechua, Swahili, Tamil, Thai, Turkish, Vietnamese & Mandarin Chinese. The dataset requires both the command of world knowledge and the ability to generalise to new languages. n/a JSON Commonsense Reasoning 2020 Ponti et al.
CrossWOZ 05.04.20 Chinese Dataset is a cross-domain wizard-of-oz task-oriented dataset. It contains dialogue sessions and utterances for 5 domains: hotel, restaurant, attraction, metro, and taxi. 6,000 JSON Dialogue 2020 Zhu et al.
Curation Corpus 03.29.20 English Dataset is a collection of 40,000 professionally-written summaries of news articles, with links to the articles themselves. 40,000 CSV Text Corpora 2020 Curation Corporation
Customer Interaction Data of German Emails and Online Requests 02.06.20 German Dataset is used to evaluate the task of automatically categorizing German customer requests. The dataset consists of a set emails and online requests sent to the support center of a multimedia software company. 627 XML Text Corpora 2014 Eichler et al.
Cyberbullying Detection (CBD) 05.26.20 Polish Dataset contains annotated tweets that identify harmful or non-harmful content. n/a TSV Classification 2019 Ptaszynski et al.
DEXTER Dataset 01.15.20 English Task given is to determine, from features given, which articles are about corporate acquisitions. 2,600 Text Classification 2008 Reuters
DNA Methylation Corpus 05.26.20 English Dataset contains 200 abstracts including a representative sample of all PubMed citations relevant to DNA methylation, and introduce manual annotation for nearly 3,000 gene/protein mentions and 1,500 DNA methylation and demethylation events. 200 Text Information Extraction, Entity Extraction, Event Extraction 2010 Ohta et al.
DOGC 03.05.20 Catalan, Spanish A collection of documents from the official journal of the Catalan Goverment in Catalan and Spanish. 21.87M XML Text Corpora, Machine Translation 2012 Tiedemann et al.
DSL Corpus Collection (DSLCC) 01.15.20 Multi-Lingual Dataset contains short excerpts of journalistic texts in similar languages and dialects. 294,000 Text Discriminating between similar languages 2017 Tang et al.
DVQA 01.15.20 English Dataset containing data visualizations and natural language questions. 3,487,194 JSON, PNG Question Answering, Visual, Commonsense 2018 Kafle et al.
DailyDialog 01.21.20 English A manually labelled conversations dataset. Categories: no emotion, anger, disgust, fear, happiness, sadness, surprise. 13,118 Text Emotion Classification 2017 Li et al.
Danish-Similarity-Dataset 03.29.20 Danish Dataset consists of 99 word pairs rated by 38 human judges according to their semantic similarity. 99 CSV Semantic Textual Similarity 2019 Schneidermann
Dataset for Fill-in-the-Blank Humor 01.15.20 English Dataset contains 50 fill-in-the-blank stories similar in style to Mad Libs. The blanks in these stories include the original word and the hint type (e.g. animal, food, noun, adverb). 50 JSON Text Generation 2017 Hossain et al.
Dataset for Intent Classification and Out-of-Scope Prediction 01.21.20 English Dataset is a benchmark for evaluating intent classification systems for dialog systems / chatbots in the presence of out-of-scope queries. 23,000+ JSON Intent Classification 2019 Larson et al.
Dataset for the Machine Comprehension of Text 01.15.20 English Stories and associated questions for testing comprehension of text. 660 Text Question Answering, Reading Comprehension 2013 Richardson et al.
Datasets Knowledge Embedding 05.04.20 English Several datasets containing edges and nodes for knowledge base building. n/a TSV Embeddings 2019 various
Dbpedia 01.15.20 Multi-Lingual The English version of the DBpedia knowledge base currently describes 6.6M entities of which 4.9M have abstracts, 1.9M have geo coordinates and 1.7M depictions. In total, 5.5M resources are classified in a consistent ontology. 6.6M Multiple Knowledge Base 2016 Dbpedia
Deal or No Deal? End-to-End Learning for Negotiation Dialogues 01.15.20 English This dataset consists of 5,808 dialogues, based on 2,236 unique scenarios dealing with negotiations and complex communication. 5,808 Text Dialogue 2017 Lewis et al.
Deft 05.26.20 English Dataset contains annotated content from two different data sources: 1) 2,443 sentences from various 2017 SEC contract filings from the publicly available US Securities and Exchange Commission EDGAR (SEC) database, and 2) 21,303 sentences from open source textbooks including topics in biology, history, physics, psychology, economics, sociology, and government. 23,746 Text Information Extraction, Definition Extraction 2019 Spala et al.
Delta Reading Comprehension Dataset 03.29.20 Chinese Dataset organizes 10,014 paragraphs from 2,108 wiki entries and highlights more than 30,000 questions from the paragraphs. 10,014 JSON Question Answering, Reading Comprehension 2019 Shao et al.
Dengue Dataset 05.26.20 Filipino Dataset for multi-class (5) classification on tweets: 5 classes: absent, dengue, health, mosquito & sick. 5,015 CSV Classification 2018 Livelo et al.
Densely Annotated Wikipedia Texts (DAWT) 02.16.20 Multi-Lingual Dataset contains a total of 13.6M articles across several languages: English, Spanish, Italian, German, French and Arabic. The annotations include labeled text mentions mapping to entities (represented by their Freebase machine ids) as well as the type of entity. 13.6M JSON Named Entity Recognition (NER) 2017 Spasojevic et al.
DiaBLa 05.04.20 French, English Parallel dataset of spontaneous, written, bilingual dialogues for the evaluation of Machine Translation, annotated for human judgments of translation quality. 5,700+ JSON Machine Translation, Dialogue 2019 Bawden et al.
Dialogue Natural Language Inference (NLI) 01.29.20 English Dataset used to improve the consistency of a dialogue model. It consists of sentence pairs labeled as entailment (E), neutral (N), or contradiction (C)." 340,000+ JSON Dialogue, Entailment 2019 Welleck et al.
Dialogue-Based Reading Comprehension Examination (DREAM) 05.26.20 English Dataset contains 10,197 multiple choice questions for 6,444 dialogues, collected from English-as-a-foreign-language examinations designed by human experts. DREAM is likely to present significant challenges for existing reading comprehension systems: 84% of answers are non-extractive, 85% of questions require reasoning beyond a single sentence, and 34% of questions also involve commonsense knowledge. 6,444 JSON Question Answering, Reading Comprehension, Dialogue 2019 Sun et al.
Did You Know (DYK) 05.26.20 Polish Dataset contains of 4,721 question–answer pairs obtained from Czy wiesz (Do you know) Wikipedia project. 4,721 TSV Question Answering 2013 Marcinczuk et al.
DiscoFuse 01.21.20 English Dataset contains examples for training sentence fusion models. Sentence fusion is the task of joining several independent sentences into a single coherent text. The data has been collected from Wikipedia and from Sports articles. ~60M TSV Sentence Fusion 2019 Geva et al.
DoQa 05.26.20 English Dataset contains domain specific FAQs via conversational QA that contains 2,437 information-seeking question/answer dialogues (10,917 questions in total) on three different domains: cooking, travel and movies. 10,917 JSON Question Answering, Dialogue 2020 Campos et al.
DocBank 06.25.20 English Dataset contains fine-grained token-level annotations for document layout analysis. It includes 5,053 documents and both the validation set and the test set include 100 documents. 5,053 n/a Document Layout Analysis 2020 Li et al.
DocRed 05.04.20 English Dataset was constructed from Wikipedia and Wikidata. It annotates both named entities and relations. 107,050 JSON Relation Extraction 2019 Yao et al.
DramaQA 05.26.20 English Dataset contains 16,191 question answer pairs from 23,928 various length video clips, with each question answer pair belonging to one of four difficulty levels. 23,928 JSON Question Answering, Visual 2020 Choi et al.
DuReader 01.15.20 Mandarin DuReader version 2.0 contains more than 300K question, 1.4M evidence documents and 660K human generated answers. 1,431,429 JSON Question Answering, Reading Comprehension 2018 He et al.
DuoRC 05.26.20 English Dataset contains 186,089 unique question-answer pairs created from a collection of 7,680 pairs of movie plots where each pair in the collection reflects two versions of the same movie. 186,089 JSON Paraphrasing Identification 2018 Saha et al.
Dutch Book Reviews 01.21.20 Dutch Dataset contains book reviews along with associated binary sentiment polarity labels. 118,516 Text Classification, Sentiment Analysis 2019 van der Burgh
E2E 05.26.20 English Dataset contains 50k combinations of a dialogue-act-based meaning representation and 8.1 references on average in the restaurant domain. 50,000 xlsx Text Generation 2019 Novikova et al.
ECB Corpus 03.05.20 Multi-Lingual Website and documentation from the European Central Bank. Contains 19 languages. 30.55M XML Text Corpora, Machine Translation 2012 Tiedemann et al.
EMEA 03.05.20 Multi-Lingual A parallel corpus made out of PDF documents from the European Medicines Agency. Contains 22 languages. 26.51M XML Machine Translation 2012 Tiedemann et al.
EmoBank 01.29.20 English Dataset is a large-scale text corpus manually annotated with emotion according to the psychological Valence-Arousal-Dominance scheme. 10,000 CSV Text Corpora 2017 Buechel et al.
Emoter Dataset 05.26.20 English Dataset contains 8,000 quotes (sentences to short paragraphs) collected and manually annotated for emotions from literature, film, and some online articles for sentiment analysis. 8,000 Text Sentiment Analysis 2018 Johnny Dunn
Emotion-Stimulus 01.21.20 English Dataset annotated with both the emotion and the stimulus using FrameNet’s emotions-directed frame. 820 sentences with both cause and emotion and 1594 sentences marked with their emotion tag. Categories: happiness, sadness, anger, fear, surprise, disgust and shame. 2,414 XML Emotion Classification 2015 Ghazi et al.
EmpatheticDialogues 01.29.20 English Dataset of 25k conversations grounded in emotional situations. 25,000 CSV Dialogue 2019 Rashkin et al.
Enron Email Dataset 01.15.20 English Emails from employees at Enron organized into folders. ~500,000 Text Text Corpora 2004 (2015) Klimt et al.
Essex Arabic Summaries Corpus (EASC) 05.26.20 Arabic Dataset contains 153 Arabic articles and 765 human-generated extractive summaries of those articles. These summaries were generated using Mechanical Turk. 153 Text Summarization 2013 El-Haj
Eubookshop 03.05.20 Multi-Lingual Corpus of documents from the EU bookshop. Contains 48 languages. 173.20M XML Text Corpora, Machine Translation 2012 Tiedemann et al.
Europarl-ST 03.05.20 Multi-Lingual Dataset contains paired audio-text samples for speech translation, constructed using the debates carried out in the European Parliament in the period between 2008 and 2012. Contains 6 Euro languages: German, English, Spanish, French, Italian and Portuguese. n/a n/a Speech Translation 2020 Iranzo-SĂĄnchez et al.
European Parliament Proceedings (Europarl) 01.15.20 Multi-Lingual The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 21 European languages. 10M+ XML Text Corpora, Machine Translation 2002 Koehn et al.
Europeana Newspapers 02.16.20 Multi-Lingual Named Entity Recognition corpora for Dutch, French, German languages from Europeana Newspapers. Data is encoded in the IOB format. 486,218 BIO Named Entity Recognition (NER) 2016 Neudecker
Event-focused Emotion Corpora for German and English 01.21.20 English, German German and English emotion corpora for emotion classification, annotated with crowdsourcing in the style of the ISEAR resources. 2,002 TSV Text Corpora, Emotion Classification 2019 Troiano et al.
Event2Mind 01.21.20 English Dataset contains 25,000 events and free-form descriptions of their intents and reactions 25,000 CSV Commonsense Inference 2018 Rashkin et al.
EventQA 05.04.20 English A dataset for answering Event-Centric questions over Knowledge Graphs (KGs). It contains 1,000 semantic queries and the corresponding verbalisations. 1,000 JSON Question Answering, Knowledge Base 2019 Souza Costa et al.
Examiner Pseudo-News Corpus 01.15.20 English Clickbait, spam, crowd-sourced headlines from 2010 to 2015. 3,089,781 CSV Clustering, Events, Sentiment Analysis 2017 Rohit Kulkarni
Excitement Datasets 02.06.20 English, Italian Datasets contain negative feedbacks from customers where they state reasons for dissatisfaction with a given company. The datasets are available in English and Italian. n/a XML Classification, Sentiment Analysis 2015 Kotlerman et al.
Exhaustive PTM Corpus 05.26.20 English Dataset contains 360 abstracts manually annotated in the BioNLP Shared Task event representation for over 4,500 mentions of proteins and 1,000 statements of modification events of nearly 40 different types. 360 Text Information Extraction, Event Extraction 2011 Pyysalo et al.
Explain Like I’m Five (ELI5) 03.05.20 English The dataset contains 270K threads of open-ended questions that require multi-sentence answers. It was extracted from subreddit titled “Explain Like I’m Five” (ELI5), in which an online community answers questions with responses that 5-year-olds can comprehend. Facebook scripts allow you to preprocess data. 270,000 Text Question Answering, Reading Comprehension 2019 Fan et al.
Explanations for Science Questions 01.15.20 English Data contains: gold explanation sentences supporting 363 science questions, relation annotation for a subset of those explanations, and a graphical annotation tool with annotation guidelines. 1,363 CSV Question Answering, Reading Comprehension 2016 Jansen et al.
FB15K-237 Knowledge Base Completion Dataset 05.26.20 English Dataset contains knowledge base relation triples and textual mentions of Freebase entity pairs. 237 relations, 14,451 entities Text Relation Prediction 2015 Toutanova et al.
FQuAD 03.05.20 French Dataset contains 25,000+ questions on a set of Wikipedia articles, modeled after SQuAD. 25,000+ JSON Question Answering, Reading Comprehension 2020 d’Hoffschmidt et al.
FT Speech 06.25.20 Danish Dataset contains recorded meetings of the Danish Parliament, otherwise known as the Folketing (FT). The corpus contains over 1,800 hours of transcribed speech by a total of 434 speakers. 1800 Hours n/a Speech Corpora 2020 Kirkedal et al.
Fact Extraction and Verfication (FEVER) 05.26.20 English Dataset contains 185,445 claims generated by altering sentences extracted from Wikipedia and subsequently verified without knowledge of the sentence they were derived from. The claims are classified as supported, rufted or notenoughinfo. 185,445 JSON Classification, Fake News Detection 2018 Thorne et al.
Fact-based Visual Question Answering (FVQA) 01.29.20 English Dataset contains image question anwering triples 5,826 questions; 2,190 images JSON Question Answering, Visual 2017 Wang et al.
FakeNewsNet 05.26.20 English Repo contains two datasets with news content, social context, and spatiotemporal information from Politifact and Gossipcop. n/a CSV Classification, Fake News Detection 2018 Shu et al.
Finlex 03.05.20 Finnish, Swedish Dataset is a collection of legislative and other judicial information of Finland, which is available in Finnish and Swedish. 7.98M XML Text Corpora, Machine Translation 2012 Tiedemann et al.
Finnish News Corpus for Named Entity Recognition 03.29.20 Finnish Dataset contains 953 articles (193,742 word tokens) with 6 named entity classes: organization, location, person, product, event, and date. 953 CSV Named Entity Recognition (NER) 2018 GĂŒngör & Sohrab et al.
Fiskmö 03.05.20 Finnish, Swedish Dataset is a parallel corpus of Finnish and Swedish Languages. 4.24M XML Machine Translation 2012 Tiedemann et al.
Flickr30K Entities 05.26.20 English Dataset contains 244k coreference chains and 276k manually annotated bounding boxes for each of the 31,783 images and 158,915 English captions (five per image) in the original dataset. 31,783 Text, XML Automatic Image Captioning 2017 Plummer et al.
Focused Open Biology Information Extraction (FOBIE) 05.26.20 English Dataset contains 1,500 manually-annotated sentences that express domain-independent relations between central concepts in a scientific biology text, such as trade-offs and correlations. 1,500 JSON Relation Extraction 2020 Kruiper et al.
Frames 05.04.20 English Dataset contains 1,369 human-human dialogues with an average of 15 turns per dialogue. This corpus contains goal-oriented dialogues between users who are given some constraints to book a trip and assistants who search a database to find appropriate trips. 1,369 JSON Dialogue 2017 Asri et al.
FreebaseQA 05.26.20 English Dataset contains 28,348 unique questions for open domain QA over the Freebase knowledge graph. 28,348 JSON Question Answering, Knowledge Graph 2019 Jiang et al.
GAP Coreference Dataset 02.16.20 English Dataset contains 8,908 gender-balanced coreference-labeled pairs of (ambiguous pronoun, antecedent name), sampled from Wikipedia. 8,908 TSV Coreference Resolution 2018 Webster et al.
GQA 01.15.20 English Question answering on image scene graphs. 22M JSON, H5 Question Answering, Visual, Commonsense 2019 Hudson et al.
Genia 05.26.20 English Dataset contains 1,999 Medline abstracts, selected using a PubMed query for the three MeSH terms "human", "blood cells", and "transcription factors". The corpus has been annotated for part-of-speech, contituency syntactic, terms, events, relations, and coreference. 1,999 Text, XML Part of Speech (POS), Constituency, Coreference, Event, Relation 2003 Kim et al.
GeoQuery 02.16.20 English Dataset contains utterances issued to a database of US geographical facts. 877 JSON Semantic Parsing, Text-to-SQL 2017 Zelle & Iyer et al.
GermEval 2014 NER Shared Task 02.06.20 German The data was sampled from German Wikipedia and News Corpora as a collection of citations.The dataset covers over 31,000 sentences corresponding to over 590,000 tokens. 31,000+ TSV Named Entity Recognition (NER) 2014 Benikova et al.
Get it #OffMyChest 06.25.20 English Dataset is used for affective understanding of conversations focusing on the problem of how speakers use emotions to react to a situation and to each other. Posts were taken from the 2018 top reddit posts from /r/CasualConversations and /r/OffMyChest. 437,860 CSV Dialogue 2020 Jaidka et al.
Gigaword 06.25.20 English Dataset contains headline-generation on a corpus of article pairs from Gigaword consisting of around 4 million articles. 4M Text Summarization 2015 Rush et al.
Global Voices Parallel Corpus 02.06.20 Multi-Lingual Dataset contains news articles from the web site Global Voices in multiple languages. n/a Text Machine Translation 2015 CASMACAT
GoEmotions 05.26.20 English Dataset contains 58K carefully curated Reddit comments labeled for 27 emotion categories: admiration, amusement, anger, annoyance, approval, caring, confusion, curiosity, desire, disappointment, disapproval, disgust, embarrassment, excitement, fear, gratitude, grief, joy, love, nervousness, optimism, pride, realization, relief, remorse, sadness, & surprise. 58,000 CSV Classification, Emotion Recognition 2020 Demszky et al.
Google Books N-grams 01.15.20 Multi-Lingual N-grams from a very large corpus of books. 2.2 TB of text Text Classification, Clustering 2011 Google
Groningen Meaning Bank 02.06.20 English Datasets contains texts in raw and tokenised format, tags for part of speech, named entities and lexical categories, and discourse representation structures compatible with first-order logic. 10,000 XML Text Corpora 2014 University of Groningen
Groove MIDI Dataset (GMD) 06.25.20 đŸ€· Dataset is composed of 13.6 hours of aligned MIDI and (synthesized) audio of human-performed, tempo-aligned expressive drumming. 13.6 Hours MIDI, Wav Audio Generation 2019 Gillick et al.
Gutenberg Dialogue 05.04.20 Multi-Lingual A dataset created by extracting dialogue from the Gutenberg book collection, comprising of ~60,000 books. Currently it supports English, German, Dutch, Spanish, Portuguese, Italian, and Hungarian. 59,971 n/a Dialogue 2020 Csaky et al.
Guttenberg Book Corpus 01.15.20 Multi-Lingual Dataset contains 60,000 eBooks. 60,000 Text Text Corpora 1996-2019 Guttenberg
HAREM 05.26.20 Portuguese Dataset used for Named-Entity Recognition (NER) in Portuguese. n/a XML Named Entity Recognition (NER) 2008 Mota et al.
HJDataset 05.04.20 Japanese Dataset contains over 250,000 layout element annotations of seven types in Japanese documents. 250,000+ JSON Text Corpora 2020 Shen et al.
Hansards Canadian Parliament 01.15.20 English Dataset contains pairs of aligned text chunks (sentences or smaller fragments) from the official records (Hansards) of the 36th Canadian Parliament. 1.3M Text Text Corpora 2001 Natural Language Group - USC
Harvard Library 01.15.20 English Dataset contains books, journals, electronic resources, manuscripts, archival materials, scores, audio, video and other materials. 12.7M MODS, Dublin Core Text Corpora n/a Harvard
Hate Speech Dataset 05.26.20 Filipino Dataset contains tweets that are labeled as hate speech or non-hate speech. Collected during the 2016 Philippine Presidential Elections. 18,464 CSV Classification 2019 Cabasag et al.
Hate Speech Identification Dataset 01.15.20 English Dataset contains lexicons, notebooks containing content that is racist, sexist, homophobic, and offensive in general. n/a CSV Classification 2017 Davidson et al.
Hebrew Parallel Movie Subtitles 06.25.20 Hebrew Dataset derived from subtitles of movies and television shows for the purpose of semantic role labeling in Hebrew. It includes both FrameNet and PropBank annotations. 30,789 n/a Semantic Role Labeling 2020 Eyal et al.
HellaSwag 01.29.20 English Dataset for studying grounded commonsense inference. It consists of 70k multiple choice questions about grounded situations: each question comes from one of two domains -- activitynet or wikihow -- with four answer choices about what might happen next in the scene. 70,000 JSON Commonsense Reasoning 2019 Zellers et al.
Historical Newspapers Daily Word Time Series Dataset 01.15.20 English Dataset contains daily contents of newspapers published in the US and UK from 1836 to 1922. 25,000 n/a Text Corpora 2017 Dzogang et al.
Home Depot Product Search Relevance 01.15.20 English Dataset contains a number of products and real customer search terms from Home Depot's website. n/a CSV Classification 2015 Home Depot
Hong Kong Stock Exchange, the Securities and Futures Commission of Hong Kong 06.25.20 Chinese, English Dataset contains aligned sentence pairs from bilingual texts, covering the financial and legal domains in Hong Kong. The sources include government legislations and regulations, stock exchange announcements, financial offering documents, regulatory filings, regulatory guidelines, corporate constitutional documents and others. 350,000+ TSV Text Corpora 2020 Translatefx
HotpotQA 01.15.20 English Dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems. 1.25M JSON Question Answering, Reading Comprehension 2018 Yang et al.
How2 03.05.20 English, Portuguese Dataset of instructional videos covering a wide variety of topics across video clips (about 2,000 hours), with word-level time alignments to the ground-truth English subtitles. And 300 hours was translated into Portuguese subtitles. ~2,000 Hours n/a Speech-to-Text, Translation, Summarization, Visual 2018 Sanabria et al.
Human-in-the-loop Dialogue Simulator (HITL) 01.15.20 English Dataset provides a framework for evaluating a bot’s ability to learn to improve its performance in an online setting using feedback from its dialog partner. The dataset contains questions based on the bAbI and WikiMovies datasets, with the addition of feedback from the dialog partner. n/a Text Question Answering, Reading Comprehension 2016 Li et al.
Humicroedit 06.25.20 English Dataset contains 15,095 edited news headlines and their numerically assessed humor. 15,095 CSV Classification 2019 Hossain et al.
HybridQA 05.04.20 English Dataset contains over 70K question-answer pairs based on 13,000 tables, each table is in average linked to 44 passages. 70,000 JSON Question Answering, Knowledge Base 2020 Chen et al.
IIT Bombay English-Hindi Corpus 01.21.20 English, Hindi Dataset contains parallel corpus for English-Hindi as well as monolingual Hindi corpus collected from a variety of existing sources. 1.49M n/a Machine Translation 2018 Kunchukuttan et al.
IRC Disentanglement 05.26.20 English Dataset contains 77,563 messages of internet relay chat (IRC). Almost all are from the Ubuntu IRC Logs. 77,563 Text Dialogue 2019 Kummerfeld et al.
IWSLT 15 English-Vietnamese 01.15.20 Multi-Lingual Sentence pairs for translation. 133,000 Text Machine Translation 2015 Stanford
IWSLT'15 English-Vietnamese  02.06.20 Multi-Lingual Parallel corpus used for machine translation English-Vietnamese. ~130,000 Text Machine Translation 2015 Hong et al.
Igbo Text 05.26.20 Igbo, English Dataset is a parallel dataset for the Urhobo language. 10.3M Text, XML Text Corpora, Machine Translation 2019 Ruoho Ruotsi
Indic Languages Multilingual Parallel Corpus 02.16.20 Indian Dataset contains several languages: Bengali, Hindi, Malayalam, Tamil, Telugu, Sinhalese, Urdu and English. The corpus has been collected from OPUS and belongs to the spoken language (OpenSubtitles) domain. n/a Tar Machine Translation 2018 NICT & Kyoto Univ.
InfoTabs 05.26.20 English Dataset contains human-written textual hypotheses based on premises that are tables extracted from Wikipedia info-boxes. 2,540 TSV, HTML, JSON Natural Language Inference (NLI) 2020 Gupta et al.
InsuranceQA 01.29.20 English Dataset contains questions and answers collected from the website Insurance Library. It consists of questions from real world users, the answers with high quality were composed by professionals with deep domain knowledge. There are 16,889 questions in total. 16,889 n/a Question Answering, Reading Comprehension 2015 Feng et al.
Irony Sarcasm Analysis Corpus 01.29.20 English Dataset contains tweets in 4 subgroups: irony, sarcasm, regular and figurative. Requires using Twitter API in order to obtain tweets. 33,000 TSV Classification, Sentiment Analysis 2016 Ling et al.
JW300 06.25.20 Multi-Lingual Dataset is parallel corpus of over 300 languages with around 100 thousand parallel sentences per language pair on average. 105.11M XML Machine Translation 2019 Agic et al.
Jeapardy Questions Answers 01.15.20 English Dataset contains Jeopardy questions, answers and other data. 216,930 JSON Question Answering, Reading Comprehension 2014 Anonymous
KALIMAT Multipurpose Arabic Corpus 05.26.20 Arabic Dataset contains 20,291 Arabic articles collected from the Omani newspaper Alwatan. Extractive Single-document and multi-document system summaries. Named Entity Recognised articles. The data has 6 categories: culture, economy, local-news, international-news, religion, and sports. 20,291 Text Summarization, Named Entity Recognition (NER), Part-of-Speech (POS) 2013 El-Haj et al.
KdConv 05.04.20 Chinese Dataset is a Chinese multi-domain dataset, grounding the topics in multi-turn conversations to knowledge graphs. KdConv contains 4.5K conversations from three domains (film, music, and travel), and 86K utterances with an average turn number of 19.0. 4,500 JSON Dialogue, Knowledge Graph 2020 Zhou et al.
Kensho Derived Wikimedia Dataset (KDWD) 02.06.20 English Dataset contains two main components - a link annotated corpus of English Wikipedia pages and a compact sample of the Wikidata knowledge base. n/a CSV, JSON Text Corpora, Knowledge Base 2020 Kensho R&D
Khaleej-2004 Corpus 03.29.20 Arabic Dataset contains more than 5,000 articles which correspond to nearly 3 millions words across 4 topics: International News, Local News, Economy, and Sports. 5,690 HTML Text Corpora 2004 Abbas et al.
KorNLI 05.04.20 Korean Dataset used for natural language inference for the Korean language. 950,354 TSV Natural Language Inference (NLI) 2020 Ham et al.
KorQuAD 03.05.20 Korean Dataset containing a total of 100,000+ question answer pairs. 102,960 JSON Question Answering, Reading Comprehension 2019 Lim et al.
KorSTS 05.04.20 Korean Dataset used for semantic textual similarity for the Korean language. 8,628 TSV Semantic Textual Similarity 2020 Ham et al.
Korean Hate Speech Dataset 06.25.20 Korean Dataset contains ~9,4K manually labeled entertainment news comments for identifying Korean toxic speech. 9,381 Text Classification 2020 Moon et al.
Korean Single Speaker Dataset (KSS) 05.26.20 Korean Dataset consists of audio files recorded by a professional female voice actress and their aligned text extracted from books. 12,853 WAV Text-to-Speech 2019 Kyubyong Park
LC-QuAD 2.0 03.05.20 English Dataset contains questions and SPARQL queries. LC-QuAD uses DBpedia v04.16 as the target KB. 30,000 JSON Question Answering, Knowledge Graph 2017 Dubey et al.
LCSTS 05.26.20 Chinese Dataset constructed from the Chinese microblogging website Sina Weibo. It consists of over 2 million real Chinese short texts with short summaries given by the author of each text. Requires application. 2M+ n/a Summarization 2015 Hu et al.
LIAR Dataset 05.26.20 English Dataset contains 12.8K manually labeled short statements in various contexts from POLITIFACT.COM, which provides detailed analysis report and links to source documents for each case. 12,800 CSV Classification, Fake News Detection 2017 Wang et al.
Language Modeling Broadened to Account for Discourse Aspects (LAMBADA) 02.06.20 English Dataset contains narrative passages sharing the characteristic that human subjects are able to guess their last word if they are exposed to the whole passage, but not if they only see the last sentence preceding the target word. 10,022 Text Natural Language Understanding, Language Modeling 2016 Paperno et al.
Large Movie Review Dataset - Imdb 02.06.20 English Dataset contains 25,000 highly polar movie reviews for training, and 25,000 for testing 50,000 Text Classification, Sentiment Analysis 2011 Maas et al.
Legal Case Reports 01.15.20 English Federal Court of Australia cases from 2006 to 2009. 4,000 Text Classification 2012 Galgani et al.
Leipzig Corpora Collection 05.04.20 Multi-Lingual Dataset containing 252 languages of web crawled news corpora. n/a Text Text Corpora 2012 Goldhahn et al.
Libri-Light 05.26.20 English Dataset contains 60K hours of unlabelled speech from audiobooks in English and a small labelled data set (10h, 1h, and 10 min). 60,000 Hours FLAC, JSON Speech Recognition 2019 Khan et al.
LibriMix 06.25.20 English Dataset is used for speech source separation in noisy environments. It is derived from LibriSpeech signals (clean subset) and WHAM noise. It offers a free alternative to the WHAM dataset and complements it. 400+ Hours n/a Speech Seperation 2020 Cosentino et al.
LibriSpeech ASR 01.15.20 English Large-scale (1000 hours) corpus of read English speech. n/a FLAC Speech Recognition 2015 OpenSLR
LibriTTS 06.25.20 English Dataset is a multi-speaker English corpus of approximately 585 hours of read English speech at 24kHz sampling rate. 585 Hours MP3 Text-to-Speech 2019 Zen et al.
LibriVoxDeEn 03.05.20 German, English Dataset contains sentence-aligned triples of German audio, German text, and English translation, based on German audio books. The corpus consists of over 100 hours of audio material and over 50k parallel sentences. 50,000+ Text, TSV Speech Translation, Machine Translation 2019 Beilharz et al.
Ling-Spam Dataset 01.15.20 English Corpus contains both legitimate and spam emails. n/a Text Classification 2000 Androutsopoulos et al.
Linked WikiText-2 05.04.20 English Dataset contains over 2 million tokens from Wikipedia articles, along with annotations linking mentions to their corresponding entities and relations in Wikidata. 2M JSON Knowledge Graph 2019 Logan et al.
LitBank 02.06.20 English Dataset contains 100 works of English-language fiction. It currently contains annotations for entities, events and entity coreference in a sample of ~2,000 words from each of those texts, totaling 210,532 tokens. 100 TSV, Text Named Entity Recognition (NER) 2019 Bamman et al.
Ljspeech 06.25.20 English Dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books. A transcription is provided for each clip. Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours. ~24 Hours Wav Speech Corpora 2017 Keith Ito
Logic2Text 05.26.20 English Dataset contains 5,600 tables and 10,753 descriptions involving common logic types paired with the underlying logical forms. 10,753 JSON Data-to-Text 2020 Chen et al.
LogicNLG 05.26.20 English Dataset is a table-based factchecking dataset with rich logical inferences in the annotated statements. 37,000 JSON Data-to-Text 2020 Chen et al.
MATINF 05.04.20 Chinese A labeled dataset for classification, question answering and summarization. MATINF contains 1.07 million question-answer pairs with human-labeled categories and usergenerated question descriptions. 1.07M n/a Classification, Question Answering, Summarization 2020 Xu et al.
MEDIQA-Answer Summarization 06.25.20 English Dataset containing question-driven summaries of answers to consumer health questions. 156 JSON Summarization 2020 Savery et al.
MLSUM 05.26.20 Multi-Lingual Dataset was collected from online newspapers, it contains 1.5M+ article/summary pairs in 5 languages: French, German, Spanish, Russian, & Turkish. 1.5M+ n/a Summarization 2020 Scialom et al.
MMD 06.25.20 English Dataset contains over 150K conversation sessions between shoppers and sales agents. 150,000+ JSON Dialogue 2017 Saha et al.
MPQA Opinion Corpus 05.04.20 English Dataset contains news articles and other text documents manually annotated for opinions and other private states (i.e., beliefs, emotions, sentiments, speculations, etc.). 70 XML Sentiment Analysis 2015 Deng et al.
MSParS 01.15.20 English Dataset for the open domain semantic parsing task. 81,826 Satori Semantic Parsing 2019 Microsoft
MalayalamMixSentiment 06.25.20 Malayalam Dataset contains 6,739 comments and 7,743 distinct sentences. There are 5 classes: Positive, Negative, Mixed feelings, Neutral, and Non-Malayalam. Requires to email author for dataset download. 6,739 n/a Sentiment Analysis 2020 Chakravarthi et al.
ManyModalQA 06.25.20 English Dataset contains 10,190 questions, 2,873 images, 3,789 text, and 3,528 tables scraped from Wikipedia. 10,190 JSON, PNG Question Answering, Multi-Modal 2020 Hannan et al.
Math Dataset 06.25.20 English Dataset contains mathematical question and answer pairs, from a range of question types at roughly school-level difficulty. n/a Text Mathematical Reasoning 2019 Saxton et al.
MathQA 05.04.20 English Dataset contains English multiple-choice math word problems covering multiple math domain categories by modeling operation programs corresponding to word problems in the AQuA dataset. 37,000 JSON Question Answering, Reading Comprehension 2019 Amini et al.
Meta-Learning Wizard-of-Oz (MetaLWOz) 01.15.20 English Dataset designed to help develop models capable of predicting user responses in unseen domains. It was created by crowdsourcing 37,884 goal-oriented dialogs, covering 227 tasks in 47 domains. 37,884 Text Dialogue 2019 Microsoft
Microsoft Information-Seeking Conversation (MISC) dataset 01.15.20 English Dataset contains recordings of information-seeking conversations between human “seekers” and “intermediaries”. It includes audio and video signals; transcripts of conversation; affectual and physiological signals; recordings of search and other computer use; and post-task surveys on emotion, success, and effort. n/a various Speech Recognition, Dialogue, Visual 2018 Microsoft
Microsoft Machine Reading COmprehension Dataset (MS MARCO) 01.15.20 English Dataset focused on machine reading comprehension, question answering, and passage ranking, keyphrase extraction, and conversational search studies. 1,010,916 JSON Question Answering, Reading Comprehension 2016 Bajaj et al.
Microsoft Research Paraphrase Corpus (MRPC) 01.15.20 English Dataset contains pairs of sentences which have been extracted from news sources on the web, along with human annotations indicating whether each pair captures a paraphrase/semantic equivalence relationship. 5,800 Text Paraphrasing Identification 2005 Dolan et al.
Microsoft Research Social Media Conversation Corpus 01.15.20 English A-B-A triples extracted from Twitter. 4,232 Text Graph Analysis 2016 Sordoni et al.
Microsoft Speech Corpus 01.15.20 Indian Dataset contains conversational and phrasal speech training and test data for Telugu, Tamil and Gujarati languages. n/a Wav Speech Recognition 2019 Microsoft
Microsoft Speech Language Translation Corpus (MSLT) 01.15.20 Multi-Lingual Dataset contains conversational, bilingual speech test and tuning data for English, Chinese, and Japanese. It includes audio data, transcripts, and translations; and allows end-to-end testing of spoken language translation systems on real-world data. n/a Wav Speech Recognition, Machine Translation 2017 Federmann et al.
MoviE Text Audio QA (MetaQA) 05.04.20 English Dataset contains more than 400K questions for both single and multi-hop reasoning, and provides more realistic text and audio versions. MetaQA serves as a comprehensive extension of WikiMovies. 400,000+ Text, MP3 Question Answering, Knowledge Base 2018 Zhang et al.
MovieLens 01.15.20 English Dataset contains 22,000,000 ratings and 580,000 tags applied to 33,000 movies by 240,000 users. ~22M Text Clustering, Classification, Regression 2016 Harper et al.
MovieQA 06.25.20 English Dataset used to evaluate automatic story comprehension from both video and text. The data set consists of almost 15,000 multiple choice question answers obtained from over 400 movies. 14,944 JSON Multi-Modal Learning, Video Question Answering 2016 Tapaswi et al.
MovieTweetings 01.15.20 English Movie rating dataset based on public and well-structured tweets. 822,784 Text Classification, Regression 2018 Dooms
MuST-C 03.05.20 Multi-Lingual Dataset is a speech translation corpus containing 385 hours from Ted talks for speech translation from English into several languages: Dutch, French, German, Italian, Portuguese, Romanian, Russian, & Spanish. Requires filling request form. 385 Hours n/a Speech Translation 2019 Di Gangi et al.
MuTual 05.04.20 English Retrieval-based dataset for multi-turn dialogue reasoning, which is modified from Chinese high school English listening comprehension test data. 8,860 Text Dialogue 2020 Cui et al.
Multi-Domain Wizard-of-Oz Dataset (MultiWoz) 01.15.20 English Dataset of human-human written conversations spanning over multiple domains and topics. The dataset was collected based on the Wizard of Oz experiment on Amazon MTurk. 10,438 JSON Dialogue 2018 Budzianowski et al.
Multi-News 06.25.20 English Dataset consists of news articles and human-written summaries of these articles from the site newser.com. Each summary is professionally written by editors and includes links to the original articles cited. 56,216 SRC Summarization 2019 Fabbri et al.
Multi30k 03.29.20 German, English Dataset of images paired with sentences in English and German. This dataset extends the Flickr30K dataset. 31,014 n/a Machine Translation, Multi-Modal Learning 2016 Elliott et al.
MultiLing Pilot 2011 Dataset 03.29.20 Multi-Lingual Dataset is derived from publicly available WikiNews English texts and translated into 7 languages: Arabic, Czech, English, French, Greek, Hebrew, Hindi. n/a Text Summarization 2011 Giannakopoulos et al.
MultiLingual Question Answering (MLQA) 02.06.20 Multi-Lingual Dataset for evaluating cross-lingual question answering performance. ~12K QA instances in English and 5K in each other language in SQuAD format in seven languages - English, Arabic, German, Spanish, Hindi, Vietnamese and Simplified Chinese. 46,444 JSON Question Answering, Reading Comprehension 2019 Lewis et al.
MultiNLI Matched/Mismatched 01.15.20 English Dataset contains sentence pairs annotated with textual entailment information. 433,000 JSON, Text Entailment 2017 Williams et al.
Multilingual Corpus of Sentence-Aligned Spoken Utterances (MaSS) 03.29.20 Multi-Lingual Dataset of 8,130 parallel spoken utterances across 8 languages (56 language pairs). Languages: Basque, English, Finnish, French. Hungarian, Romanian, Russian, Spanish. 8,130 n/a Speech Corpora 2020 Boito et al.
Multimodal Comprehension of Cooking Recipes (RecipeQA) 01.15.20 English Dataset for multimodal comprehension of cooking recipes. It consists of over 36K question-answer pairs automatically generated from approximately 20K unique recipes with step-by-step instructions and images. 20,000 JSON Question Answering, Reading Comprehension 2018 Yagcioglu et al.
Multimodal EmotionLines Dataset (MELD) 05.04.20 English Dataset contains the same dialogue instances available in EmotionLines dataset, but it also encompasses audio and visual modality along with text. It has more than 1,400 dialogues and 13,000 utterances from Friends TV series. Each utterance in a dialogue has been labeled by any of these seven emotions: Anger, Disgust, Sadness, Joy, Neutral, Surprise and Fear. It also has sentiment (positive, negative and neutral) annotation for each utterance. 1,400 CSV, MP4 Multi-Modal Learning 2018 Poria et al.
Multimodal Sarcasm Detection Dataset (MUStARD) 05.04.20 English The dataset, a multimodal video corpus, consists of audiovisual utterances annotated with sarcasm labels. Each utterance is accompanied by its context, which provides additional information on the scenario where the utterance occurs. 6,365 JSON Multi-Modal Learning 2019 Castro et al.
MutualFriends 01.15.20 English Task where two agents must discover which friend of theirs is mutual based on the friend's attributes. n/a JSON Dialogue 2017 He et al.
NEJM-enzh 06.25.20 Chinese, English Dataset is an English-Chinese parallel corpus, consisting of about 100,000 sentence pairs and 3,000,000 tokens on each side, from the New England Journal of Medicine (NEJM). 100,000 n/a Machine Translation 2020 Liu et al.
NELA-GT-2019 05.04.20 English Dataset contains 1.12M news articles from 260 sources collected between January 1st 2019 and December 31st 2019. Included are source-level ground truth labels from 7 different assessment sites. 1.12M JSON Text Corpora, Classification 2020 Gruppi et al.
NIPS Papers 06.25.20 English Dataset contains the title, authors, abstracts, and extracted text for all NIPS papers between 1987-2016. ~3,000 CSV Text Corpora 2017 Ben Hamner
NKJP-NER 05.26.20 Polish Dataset contains extracted sentences with named entities of exactly one type. The task is to predict the type of the named entity. 20,000 TSV Named Entity Recognition (NER) 2020 Przepiorkowski
NLP Chinese Corpus 01.15.20 Chinese Large text corpora in Chinese. 10M+ JSON Text Corpora 2019 Xu et al.
NPS Chat Corpus 01.15.20 English Posts from age-specific online chat rooms. ~500,000 XML Dialogue 2007 Forsyth et al.
NSynth Dataset 06.25.20 đŸ€· Dataset contains ~300K musical notes, each with a unique pitch, timbre, and envelope. n/a JSON, Wav Audio Synthesis 2017 Engel et al.
NUS SMS Corpus 01.15.20 Mandarin, English SMS messages collected between 2 users, with timing analysis. 67,093 XML Dialogue 2013 Kan et al.
NYSK Dataset 01.15.20 English English news articles about the case relating to allegations of sexual assault against the former IMF director Dominique Strauss-Kahn. 10,421 XML Sentiment Analysis, Topic Extraction 2013 Dermouche et al.
Named Entity Model for German, Politics (NEMGP) 02.16.20 German Dataset contains texts from Wikipedia and WikiNews, manually annotated with named entity information. 5,094 Text Named Entity Recognition (NER) 2013 Zastrow
NarrativeQA 01.15.20 English Dataset contains the list of documents with Wikipedia summaries, links to full stories, and questions and answers. 1,572 CSV Question Answering, Reading Comprehension 2017 KočiskĂœ et al.
Natural Language Inference in Turkish (NLI-TR) 05.04.20 Turkish Datasets that were obtained by translating the SNLI and MNLI corpora into Turkish. n/a JSON Natural Language Inference (NLI) 2020 Budur et al.
Natural Questions (NQ) 01.15.20 English Dataset contains questions from real users, and it requires QA systems to read and comprehend an entire Wikipedia article that may or may not contain the answer to the question. 320,000+ HTML Question Answering, Reading Comprehension 2019 Kwiatkowski et al.
Neutralizing Biased Text 03.29.20 English A parallel corpus of 180,000+ sentence pairs where one sentence is biased and the other is neutralized. The data were obtained from debiasing wikipedia edits. 180,000 n/a Biased Text Neutralization 2019 Pryzant et al.
News Category Dataset 06.25.20 English Dataset contains around 200k news headlines from the year 2012 to 2018 obtained from HuffPost. ~200,000 JSON Classification 2018 Rishabh Misra
News Headlines Dataset for Sarcasm Detection 01.15.20 English High quality dataset with Sarcastic and Non-sarcastic news headlines. 26,709 JSON Clustering, Events, Language Detection 2018 Misra
News Headlines Of India 01.15.20 English Dataset contains archive of noteable events in India during 2001-2018, recorded by the Times of India. 2,969,922 CSV Text Corpora 2017 Rohit Kulkarni
NewsQA 01.15.20 English Crowdworkers supply questions and answers based on a set of over 10,000 news articles from CNN. 12,744 JSON, CSV Question Answering, Reading Comprehension 2017 Trischler et al.
Ohsumed Dataset 05.04.20 English Dataset containing references from MEDLINE, the on-line medical information database, consisting of titles and/or abstracts from 270 medical journals over a five-year period (1987-1991). n/a OKC Classification 1997 Joachims
One Week of Global News Feeds 01.15.20 Multi-Lingual Dataset contains most of the new news content published online over one week in 2017 and 2018. 3.3M CSV Text Corpora 2018 Rohit Kulkarni et al.
OneCommon 01.29.20 English Dataset contains 6,760 dialogues. 6,760 JSON Dialogue 2019 Udagawa et al.
OneSeC Small 03.29.20 Multi-Lingual Automatically-generated corpora in multiple languages with sense annotations for nouns using WordNet for English and BabelNet for all other languages as inventories of senses. 1M+ XML Word Sense Disambiguation  2019 Scarlini et al.
OneStopQA 05.26.20 English Dataset comprises 30 articles from the Guardian in 3 parallel text difficulty versions and contains 1,458 paragraph-question pairs with multiple choice questions, along with manual span markings for both correct and incorrect answers. 30 Text Question Answering, Reading Comprehension 2020 Berzak et al.
OntoNotes 5.0 01.21.20 Multi-Lingual Dataset contains various genres of text (news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference). n/a Text, SQL Information Retrieval, Syntactic Parsing 2013 Weischedel et al.
Open Images V6 03.05.20 English Dataset containing millions of images that have been annotated with image-level labels and object bounding boxes. 9,178,275 TSV, CSV Automatic Image Captioning 2018 Kuznetsova et al.
Open Research Corpus 01.15.20 English Dataset contains over 39 million published research papers in Computer Science, Neuroscience, and Biomedical. 39M JSON Text Corpora 2018 Ammar et al.
Open Resource for Click Analysis in Search (ORCAS) 06.25.20 English ORCAS is a click-based dataset associated with the TREC Deep Learning Track. It covers 1.4 million of the TREC DL documents, providing 18 million connections to 10 million distinct queries. 10,405,342 TSV Document Ranking 2020 Craswell et al.
OpenBookQA 01.15.20 English Dataset modeled after open book exams for assessing human understanding of a subject. It consists of 5,957 multiple-choice elementary-level science questions (4,957 train, 500 dev, 500 test), which probe the understanding of a small "book" of 1,326 core science facts and the application of these facts to novel situations. 5,957 JSON Question Answering, Reading Comprehension 2018 Mihaylov et al.
OpenDialKG 05.04.20 English Dataset of conversations between two crowdsourcing agents engaging in a dialog about a given topic. Each dialog turn is paired with its corresponding “KG paths” that weave together the KG entities and relations that are mentioned in the dialog. 15,000 Text Dialogue, Knowledge Graph 2019 Moon et al.
OpenKeyPhrase (OpenKP) 05.26.20 English Open domain keyphrase extraction dataset containing 148,124 real world web documents along with a human annotation indicating the 1-3 most relevant keyphrases. 148,124 JSON Question Answering, Reading Comprehension 2019 Xiong et al
OpenSubtitles 01.29.20 Multi-Lingual Dataset of multi-lingual dialogs from movie scripts. Includes 62 languages. n/a XML, XCES Dialogue 2016 Tiedemann et al.
OpenWebTextCorpus 01.15.20 English Dataset contains millions of webpages text stemming from reddit urls totalling 38Gb of text data. 8,013,769 n/a Text Corpora 2019 Gokaslan et al.
Open Super-Large Crawled Almanach Corpus (OSCAR) 01.29.20 Multi-Lingual Multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.166 different languages available. n/a Text Text Corpora 2019 Suårez et al.
OpinRank Review Dataset 01.15.20 English Reviews of cars and hotels from Edmunds.com and TripAdvisor. Edmunds: 42,230, TripAdivsor: 259,000 Text Information Retrieval, Entity Ranking, Entiry Retrieval 2011 Ganesan et al.
Opinosis 06.25.20 English Dataset contains sentences extracted from reviews for 51 topics. Topics and opinions are obtained from Tripadvisor, Edmunds.com and Amazon.com. n/a Jar Summarization 2010 Ganesan et al.
PARANMT-50M 05.26.20 English Dataset containing more than 50 million English-English sentential paraphrase pairs. 50M Text Paraphrasing Generation 2018 Wieting et al.
PG-19 02.16.20 English Dataset contains a set of books extracted rom the Project Gutenberg books library, that were published before 1919. It also contains metadata of book titles and publication dates. 28,752 Text Text Corpora, Language Modeling 2019 Rae et al.
PTM Event Corpus 05.26.20 English Dataset contains 157 PubMed abstracts annotated for over 1,000 proteins and 400 post-translational modification events identifying the modified proteins and sites. 157 Text Information Extraction, Event Extraction 2010 Ohta et al.
ParCorFull 03.29.20 German, English A parallel corpus annotated for the task of translation of corefrence across languages. 14,927 XML Machine Translation, Coreference Resolution 2018 Lapshinova-Koltunski et al.
ParaBank 06.25.20 English Dataset contains paraphrases with 79.5 million references and on average 4 paraphrases per reference. 79.5M references TSV Semantic Textual Similarity 2019 Hu et al.
ParaCrawl Corpus 05.26.20 Multi-Lingual Multiple parallel datasets of European languages for machine translation. n/a Text Machine Translation 2018 ParaCrawl Project
Parallel Arabic DIalectal Corpus (PADIC) 03.29.20 Arabic Dataset is a multi-dialectal corpus - contains six dialects in addition to MSA in Buckwalter format. 6,000+ HTML Text Corpora 2013 Abbas et al.
Parallel Meaning Bank 02.06.20 Multi-Lingual Dataset contains sentences and texts in raw and tokenised format, syntactic analysis, word senses, thematic roles, reference resolution, and formal meaning representations. The annotated parallel corpus inclues English, German, Dutch and Italian languages. 8,705 XML Text Corpora 2017 University of Groningen
Paraphrase Adversaries from Word Scrambling (PAWS) 01.21.20 English Dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase identification. 750,000+ TSV Paraphrasing Identification 2019 Zhang et al.
Paraphrase Adversaries from Word Scrambling (PAWS-X) 01.21.20 Multi-Lingual Dataset contains 23,659 human translated PAWS evaluation pairs and 296,406 machine translated training pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. All translated pairs are sourced from examples in PAWS-Wiki. 300,000+ TSV Paraphrasing Identification 2019 Yang et al.
Paraphrase and Semantic Similarity in Twitter (PIT) 01.15.20 English Dataset focuses on whether tweets have (almost) same meaning/information or not. 18,762 Text Classification 2015 Xu et al.
Perlex 05.26.20 Persian Dataset is an expert translated version of the Semeval-2010-Task-8 dataset. 10,717 n/a Relation Extraction 2020 Asgari-Bidhendi et al.
Personae Corpus 01.15.20 Dutch Collected for experiments in Authorship Attribution and Personality Prediction. Consists of 145 Dutch-language essays. 145 Text Classification, Regression 2008 Luyckx et al.
Personal Events in Dialogue Corpus 05.26.20 English Dataset is a corpus containing annotated dialogue transcripts from fourteen episodes of the podcast This American Life. It contains 1,038 utterances, made up of 16,962 tokens, of which 3,664 represent events. 1,038 Text Information Extraction, Event Extraction, Dialogue 2020 Eisenberg et al.
Personalized Dialog 01.15.20 English Dataset of dialogs from movie scripts. 12,000 Text Dialogue 2017 Joshi et al.
Physical IQA 01.29.20 English Dataset is used for commonsense QA benchmark for naive physics reasoning focusing on how we interact with everyday objects in everyday situations. The dataset includes 20,000 QA pairs that are either multiple-choice or true/false questions. 20,000 JSON Question Answering, Commonsense 2019 Bisk et al.
Plaintext Jokes 01.15.20 English 208,000 jokes in this database scraped from three sources. 208,000 JSON Text Corpora 2016 Pungas et al.
PoKi 05.04.20 English Dataset is a corpus of 61,330 poems written by children from grades 1 to 12. 61,330 CSV Text Corpora 2020 Hipson et al.
PolEmo2.0-IN & OUT 05.26.20 Polish Dataset contains online reviews from medicine and hotels domains. The task is to predict the sentiment of a review. 8,216 TSV Sentiment Analysis 2019 Kocon et al.
Polish Parliamentary Corpus (PPC) 05.26.20 Polish Dataset is a collection of linguistically analysed documents from the proceedings of Polish Parliament, Sejm and Senate. It is based on the Polish Sejm Corpus. 3,000+ XML Text Corpora 2018 Maciej Ogrodniczuk
Polish Summaries Corpus (PSC) 05.26.20 Polish Dataset contains news articles and their summaries. 723 TSV Summarization 2014 Ogrodniczuk et al.
Polusa 06.25.20 English Dataset contains 0.9M articles covering policy topics published between Jan. 2017 and Aug. 2019 by 18 news outlets representing the political spectrum. 0.9M n/a Classification 2020 Gebhard et al.
Portuguese Newswire Corpus 01.21.20 Portuguese (Brazil) Dataset contains x number of newswire articles collected between years 1994-2016. Requires preprocesing of HTML pages, found in GitHub in the download link. n/a HTML Text Corpora 2016 Boğaziçi University
Portuguese SQuAD v1.1 01.21.20 Portuguese Portuguese translation of the SQuAD dataset. The translation was performed using the Google Cloud API. ~100,000 JSON Question Answering, Reading Comprehension 2019 Carvalho et al.
Post-Modifier Dataset (PoMo) 05.26.20 English Dataset for developing post-modifier generation systems. It's a collection of sentences that contain entity post-modifiers, along with a collection of facts about the entities obtained from Wikidata. 231,057 PM, WIKI Post-Modifier Generation 2019 Kang et al.
ProPara Dataset 01.15.20 English Dataset is used for comprehension of simple paragraphs describing processes, e.g., photosynthesis. The comprehension task relies on predicting, tracking, and answering questions about how entities change during the process. 488  Google Sheets Question Answering, Reading Comprehension 2018 Mishra et al.
PubMed 200k RCT Dataset 05.04.20 English Dataset is based on PubMed for sequential sentence classification. The dataset consists of approximately 200,000 abstracts of randomized controlled trials, totaling 2.3 million sentences. 200,000 Text Classification 2017 Dernoncourt et al.
QA-SRL Bank 01.29.20 English Dataset contains question answer pairs for 64,000 sentences. Dataset is used to train model for semantic role labeling 64,000 JSON Question Answering, Semantic Role Labeling 2018 FitzGerald et al.
QA-ZRE 01.29.20 English Dataset contain question answer pairs with each instance containing a relation, a question, a sentence, and an answer set. 30M Text Question Answering, Relation Extraction 2017 Levy et al.
QASC 02.06.20 English QASC is a question-answering dataset with a focus on sentence composition. It consists of 9,980 8-way multiple-choice questions about grade school science (8,134 train, 926 dev, 920 test), and comes with a corpus of 17M sentences. 9,980 JSON Question Answering, Reading Comprehension 2020 Khot et al.
QuaRTz Dataset 01.15.20 English Dataset contains 3,864 questions about open domain qualitative relationships. Each question is paired with one of 405 different background sentences (sometimes short paragraphs). 3,864 JSON Question Answering, Reading Comprehension 2019 Tajford et al.
QuaRel Dataset 01.15.20 English Dataset contains 2,771 story questions about qualitative relationships. 2,771 JSON Question Answering, Reading Comprehension 2018 Tajford et al.
Quasar-S & T 01.15.20 English The Quasar-S dataset consists of 37,000 cloze-style queries constructed from definitions of software entity tags on the popular website Stack Overflow. The Quasar-T dataset consists of 43,000 open-domain trivia questions and their answers obtained from various internet sources. 80,000 JSON Question Answering, Reading Comprehension 2017 Dhingra et al.
Quda 05.26.20 English Dataset contains 14,035 diverse user queries annotated with 10 low-level analytic tasks that assist in the deployment of state-of-the-art machine/deep learning techniques for parsing complex human language. 14,035 Text Information Extraction, Visualization 2020 Fu et al.
Question Answering in Context (QuAC) 01.15.20 English Dataset for modeling, understanding, and participating in information seeking dialog. 14,000 JSON Question Answering, Reading Comprehension 2018 Choi et al.
Question NLI 01.15.20 English Dataset converts SQuAD dataset into sentence pair classification by forming a pair between each question and each sentence in the corresponding context. 110,000 JSON Natural Language Inference (NLI) 2018 Rajpurkar et al.
Quora Question Pairs 01.15.20 English The task is to determine whether a pair of questions are semantically equivalent. 400,000 TSV Semantic Textual Similarity 2017 Quora
Quoref 02.06.20 English Dataset which tests the coreferential reasoning capability of reading comprehension systems. In this span-selection benchmark containing 24K questions over 4.7K paragraphs from Wikipedia, a system must resolve hard coreferences before selecting the appropriate span(s) in the paragraphs for answering questions. 24,000 JSON Question Answering, Reading Comprehension 2019 Dasigi et al.
ReAding Comprehension Dataset From Examinations (RACE) 01.15.20 English Dataset was collected from the English exams evaluating the students' ability in understanding and reasoning. 28,000 JSON Question Answering, Reading Comprehension 2017 Lai et al.
ReClor 05.04.20 English Dataset contains logical reasoning questions of standardized graduate admission examinations. 6,138 n/a Reading Comprehension 2020 Yu et al.
ReVerb45k, Base and Ambiguous 01.29.20 English 3 Datasets. In total, there are 91K triples. 91,000 JSON Information Retrieval, Knowledge Base 2018 Vashishth et al.
Reading Comprehension over Multiple Sentences (MultiRC) 01.15.20 English Dataset of short paragraphs and multi-sentence questions that can be answered from the content of the paragraph. ~10,000 JSON Question Answering, Reading Comprehension 2018 Khashabi et al.
Reading Comprehension with Commonsense Reasoning Dataset (Record) 01.15.20 English Reading comprehension dataset which requires commonsense reasoning. Contains 120,000+ queries from 70,000+ news articles. 70,000+ JSON Question Answering, Reading Comprehension 2018 Zhang et al.
Reading Comprehension with Multiple Hops (Qangaroo) 01.15.20 English Reading Comprehension datasets focussing on multi-hop (alias multi-step) inference. There are 2 datasets: Wikihop (based on wikipedia) and Medhop (based on PubMed research papers). ~53,000 JSON Question Answering, Reading Comprehension 2018 Welbl et al.
Recognizing Textual Entailment (RTE) 01.15.20 English Datasets are combined and converted to two-class classification: entailment and not_entailment. n/a JSON Entailment 2006-2009 Dagan et al, Bar Haim et al, Giampiccolo, and Bentivogli et al.
Reddit All Comments Corpus 01.15.20 English All Reddit comments (as of 2017). 3,329,219,008 JSON Text Corpora 2017 Reddit
Relation Extraction Corpus 01.21.20 English A human-judged dataset of two relations involving public figures on Wikipedia: about 10,000 examples of "place of birth" and 40,000 examples of "attended or graduated from an institution." 10,000 JSON Relation Extraction 2013 Google
Relationship and Entity Extraction Evaluation Dataset (RE3D) 01.15.20 English Entity and Relation marked data from various news and government sources. n/a JSON Classification, Entity and Relation Recognition 2017 Dstl
Restaurants 02.16.20 English Dataset contains user questions about restaurants, their food types, and locations. 378 JSON Semantic Parsing, Text-to-SQL 2012 Tang/Popescu/
Reuters-21578 Benchmark Corpus 01.15.20 English Dataset is a collection of 10,788 documents from the Reuters financial newswire service, partitioned into a training set with 7769 documents and a test set with 3019 documents. 10,788 TSV Classification 1997 Lewis et al.
Rotowire and SBNation Datasets 05.26.20 English Dataset consists of (human-written) NBA basketball game summaries aligned with their corresponding box and line scores. ~15,000 JSON Data-to-Text 2017 Wiseman et al.
RuBQ 06.25.20 Russian Dataset consists of 1,500 Russian questions of varying complexity, their English machine translations, SPARQL queries to Wikidata, reference answers, as well as a Wikidata sample of triples containing entities with Russian labels. 1,500 JSON Question Answering, Knowledge Base 2020 Korablinov et al.
SAMSum 06.25.20 English Dataset contains over 16K chat dialogues with manually annotated summaries. 16,000 JSON Summarization 2019 Gliwa et al.
SCITLDR 05.26.20 English Dataset of a combination of TLDRs written by human experts and author written TLDRs of computer science papers from OpenReview. 3,900 JSON Summarization 2020 Cachola et al.
SMS Spam Collection Dataset 01.15.20 English Dataset contains SMS spam messages. 5,574 Text Classification 2011 Almeida et al.
SNAP Social Circles: Twitter Database 01.15.20 English Large Twitter network data. Nodes: 81,306, Edges:1,768,149 Text Clustering, Graph Analysis 2012 McAuley et al.
SQuAD v2.0 01.15.20 English Paragraphs w/ questions and answers. 150,000 JSON Question Answering, Reading Comprehension 2018 Rajpurkar et al.
SQuAD-it 03.05.20 Italian The dataset contains more than 60,000 question/answer pairs in Italian derived from the original English SQuAD dataset. 60,000+ JSON Question Answering, Reading Comprehension 2018 Croce et al.
Saudi Newspapers Corpus 01.15.20 Arabic Dataset contains 31,030 Arabic newspaper articles. 31,030 JSON Text Corpora 2015 Alhagri
SberQuAD 03.05.20 Russian Dataset consists of a question answers modeleld after SQuAD. 50,364 CSV Question Answering, Reading Comprehension 2019 Efimov et al.
Schema-Guided Dialogue State Tracking (DSTC 8) 01.15.20 English Dataset contains 18K dialogues between a virtual assistant and a user. ~18,000 JSON Dialogue State Tracking 2019 Rastogi et al.
Scholar 02.16.20 English User questions about academic publications, with automatically generated SQL that was checked by asking the user if the output was correct. 817 JSON Semantic Parsing, Text-to-SQL 2017 Iyer et al.
SciCite 06.25.20 English Dataset used for classifying citation intents in academic papers. The main citation intent label for each JSON object is specified with the label key while the citation context is specified in with a context key. 11,020 JSON Classification 2019 Cohan et al.
SciQ Dataset 01.15.20 English Dataset contains 13,679 crowdsourced science exam questions about Physics, Chemistry and Biology, among others. The questions are in multiple-choice format with 4 answer options each. 13,769 JSON Question Answering, Reading Comprehension 2017 Welbl et al.
SciREX 05.26.20 English Dataset is fully annotated with entities, their mentions, their coreferences, and their document level relations. 438 JSON Information Extraction 2020 Jain et al.
SciTail Dataset 01.15.20 English Dataset is an entailment dataset created from multiple-choice science exams and web sentences. Each question and the correct answer choice are converted into an assertive statement to form the hypothesis. 27,026 SNLI, TSV, DGEM Entailment 2018 Khot et al.
ScienceExamCER 06.25.20 English Dataset contains 133k mentions in the science exam domain where nearly all (96%) of content words have been annotated with one or more fine-grained semantic class labels including taxonomic groups, meronym groups, verb/action groups, properties and values, and synonyms. 133,000 Text, TSV Named Entity Recognition (NER) 2019 Smith et al.
SearchQA 01.15.20 English Dataset from Jeapardy archives which consists of more than 140k question-answer pairs with each pair having 49.6 snippets on average. 140,000 JSON Question Answering, Reading Comprehension 2017 Dunn et al.
SegmentedTables & LinkedResults 06.25.20 English Dataset mentions in captions, the type of table (leaderboard, ablation, irrelevant) and ground truth cell annotations into classes: dataset, metric, paper model, cited model, meta and task. ~2,000 JSON Table Segmentation, Table Type Classification 2020 Paperswithcode
SelQA 05.04.20 English Dataset provides crowdsourced annotation for two selection-based question answer tasks, answer sentence selection and answer triggering. Our dataset composes about 8K factoid questions for the top-10 most prevalent topics among Wikipedia articles. 8,000 JSON, TSV Question Answering, Reading Comprehension 2016 Jurczyk et al.
Self-Annotated Reddit Corpus (SARC) 05.26.20 English Dataset contains 1.3 million sarcastic comments from the Internet commentary website Reddit. It contains statements, along with their responses as well as many non-sarcastic comments from the same source. 1.3M CSV Text Corpora, Sarcasm Detection 2017 Khodak et al.
SemEval-2014 Task 3 05.04.20 English Dataset is used for cross-level semantic similarity which measures the degree to which the meaning of a larger linguistic item, such as a paragraph, is captured by a smaller item, such as a sentence. 2,000 TSV Semantic Textual Similarity 2014 Jurgens et al.
SemEval-2016 Task 4 03.29.20 English Dataset contains 5 subtasks involving the sentiment analysis of tweets. ~75,000 Text Classification, Sentiment Analysis 2016 Nakov et al.
SemEval-2019 Task 6  05.04.20 English Dataset containing tweets as either offensive or not offensive (Sub-task A) and further classifies offensive tweets into categories (Sub-tasks B – C). 14,100 TSV Classification 2019 Zampeiri et al.
SemEval-2019 Task 9 - Subtask A 02.06.20 English Suggestion Mining from Online Reviews and Forums: Dataset contains corpora of unstructured text with the intent for mining it for suggestions. ~6,300 CSV Suggestion Mining 2019 Negi et al.
SemEval-2019 Task 9 - Subtask B 02.06.20 English Suggestion Mining from Hotel Reviews: Dataset contains corpora of unstructured text with the intent for mining it for suggestions. ~800 CSV Suggestion Mining 2019 Negi et al.
SemEvalCQA 01.15.20 Arabic, English Dataset for community question answering. n/a XML Question Answering, Reading Comprehension 2016 Nakov et al.
Semantic Parsing in Context (SParC) 01.15.20 English Dataset consists of 4,298 coherent question sequences (12k+ unique individual questions annotated with SQL queries annotated byt. It is the context-dependent/multi-turn version of the Spider task. 4,298 JSON, SQL Semantic Parsing, SQL-to-Text 2019 Yu et al.
Semantic Textual Similarity Benchmark 01.15.20 English The task is to predict textual similarity between sentence pairs. 8,628 CSV Semantic Textual Similarity 2017 Cer et al.
Sentences Involving Compositional Knowledge (SICK) 02.06.20 English Dataset contains sentence pairs, generated from two existing sets: the 8K ImageFlickr data set and the SemEval 2012 STS MSR-Video Description. ~10,000 Text Semantic Textual Similarity, Entailment 2014 Marelli et al.
Sentiment Corpus of App Reviews with Fine-grained Annotations in German (SCARE) 01.29.20 German Dataset consists of fine-grained annotations for mobile application reviews from the Google Play Store. For each user review the mentioned application aspects, i.e., the design or the usability, as well as subjective phrases, which evaluate these aspects, are annotated. In addition, the polarity (positive, negative or neutral) of each subjective phrase is recorded as well as the relationship of an aspect to the main app in discussion. Requires emailing source for password to retrieve data. 800,000 CSV Classification, Sentiment Analysis 2016 SĂ€nger et al.
Sentiment Labeled Sentences Dataset 01.15.20 English Dataset contains 3000 sentiment labeled sentences. 3,000 Text Classification, Sentiment Analysis 2015 Kotzias
Sentiment140 01.15.20 English Tweet data from 2009 including original text, time stamp, user and sentiment. 1,578,627 CSV Classification, Sentiment Analysis 2009 Go et al.
Sequential Question Answering (SQA) 05.04.20 English Dataset was created to explore the task of answering sequences of inter-related questions on HTML tables. It has 6,066 sequences with 17,553 questions in total. 17,553 TSV Question Answering, Semantic Parsing 2016 Iyyer et al.
Shaping Answers with Rules through Conversation (ShARC) 01.15.20 English ShARC is a Conversational Question Answering dataset focussing on question answering from texts containing rules. 32,000 JSON Question Answering, Reading Comprehension 2018 Saeidi et al.
SherLIiC 05.04.20 English Dataset contains manually annotated inference rule candidates (InfCands), accompanied by ~960k unlabeled InfCands, and ~190k typed textual relations between Freebase entities extracted from the large entity-linked corpus ClueWeb09. ~960,000 Text Natural Language Inference (NLI), Lexical Inference/Entailment 2019 Schmitt et al.
Short Answer Scoring 01.15.20 English Student-written short-answer responses. n/a TSV Scoring Classification 2012 The Hewlett Foundation
Simplified Versions of the CommAI Navigation tasks (SCAN) 01.29.20 English Dataset used for for studying compositional learning and zero-shot generalization. SCAN consists of a set of commands and their corresponding action sequences. 20,000+ Text Compositional Learning 2018 Lake et al.
Situations With Adversarial Generations (SWAG) 01.15.20 English Dataset consists of 113k multiple choice questions about grounded situations. Each question is a video caption from LSMDC or ActivityNet Captions, with four answer choices about what might happen next in the scene. 113,000 CSV Question Answering, Reading Comprehension 2018 Zellers et al.
Skytrax User Reviews Dataset 01.15.20 English User reviews of airlines, airports, seats, and lounges from Skytrax. 41,396 CSV Classification, Sentiment Analysis 2015 Nguyen
Soccer Dialogues 01.21.20 English Dataset contains soccer dialogues over a knowledge graph 2,890 JSON Knowledge Graphs, Dialogue 2019 SDA Lab, Uni. Of Bonn & Volkswagen Research
Social IQA 01.29.20 English Dataset used fo question-answering benchmark for testing social commonsense intelligence. 37,000+ JSON Question Answering, Commonsense 2019 Sap et al.
Social Media Mining for Health (SMM4H) 01.21.20 English Dataset contains medication-related text classification and concept normalization from Twitter 25,678 Text Classification 2018 Sarker et al.
Social-IQ Dataset 01.15.20 English Dataset containing videos and natural language questions for visual reasoning. 7,500 n/a Question Answering, Visual, Commonsense 2019 Zadeh et al.
Some Like it Hoax 05.26.20 English Dataset contains 15,500 posts from 32 pages (14 conspiracy and 18 scientific). 15,500 JSON Classification, Fake News Detection 2017 Tacchini et al.
Spambase Dataset 01.15.20 English Dataset contains spam emails. 4,601 Text Classification 1999 Hopkins et al.
Spider 1.0 01.15.20 English Dataset consists of 10,181 questions and 5,693 unique complex SQL queries on 200 databases with multiple tables covering 138 different domains. 10,181 JSON, SQL Semantic Parsing, SQL-to-Text 2018 Yu et al.
Stack Overlow BigQuery Dataset 01.15.20 English BigQuery dataset includes an archive of Stack Overflow content, including posts, votes, tags, and badges. n/a n/a Text Corpora 2018 Stack Overflow
Stanford Natural Language Inference (SNLI) Corpus 01.15.20 English Image captions matched with newly constructed sentences to form entailment, contradiction, or neutral pairs. 570,000 Text Natural Language Inference (NLI) 2015 Bowman et al.
Statutory Reasoning Assessment (SARA) 05.26.20 English Dataset contains a set of rules extracted from the statutes of the US Internal Revenue Code (IRC), together with a set of natural language questions which may only be answered correctly by referring to the rules. 100 Text Text Corpora 2020 Holzenberger et al.
Street View Text (SVT) 05.26.20 English Dataset contains images with textual content used for scene text recognition. n/a XML, JPG Multi-Modal Learning, Scene Text Recognition 2012 Wang et al.
Surrey Audio-Visual Expressed Emotion (SAVEE) 06.25.20 English Dataset consists of recordings from 4 male actors in 7 different emotions, 480 British English utterances in total. The sentences were chosen from the standard TIMIT corpus and phonetically-balanced for each emotion. 480 Wav Emotion Recognition, Audio 2007 Vlasenko et al.
Switchboard Dialogue Act Corpus (SwDA) 01.21.20 English A subset of the Switchboard-1 corpus consisting of 1,155 conversations and 42 tags 1,155 UTT Dialogue Act Classification 1997 Bates et al.
T-REx 01.15.20 English Dataset contains Wikipedia abstracts aligned with Wikidata entities. 11M aligned triples JSON and NIF Relation Extraction 2018 Elsahar et al.
T4SS Event Corpus 05.26.20 English Dataset contains 27 full text publications totaling 15,143 pseudo-sentences (text sentences plus table rows, references, etc.) and 244,942 tokens covering 4 classes: Bacteria, Cellular components, Biological Processes, and Molecular functions. 27 Text Information Extraction, Event Extraction 2010 Pyysalo et al.
TGIF-QA 06.25.20 English Dataset consists of 165K QA pairs from 72K animated GIFs. Used for video question answering. 165,000 CSV Multi-Modal Learning, Video Question Answering 2017 Jang et al.
TVQA 06.25.20 English Dataset is used for video question answering and consists of 152,545 QA pairs from 21,793 clips, spanning over 460 hours of video. 460+ Hours HDF5, JSON Multi-Modal Learning, Video Question Answering 2018 Lei et al.
TabFact 01.15.20 English Dataset contains 16k Wikipedia tables as evidence for 118k human annotated statements to study fact verification with semi-structured evidence. 16,000 JSON Natural Language Inference (NLI) 2020 Chen et al.
Talk the Walk 06.25.20 English Dataset consists of over 10k crowd-sourced dialogues in which two human annotators collaborate to navigate to target locations in the virtual streets of NYC. 10,000+ JSON, JPG Dialogue, Grounded Language Learning 2018 Vries et al.
Tanzil 06.25.20 Multi-Lingual Dataset is a collection of Quran translations in 42 languages. 1.01M XML Machine Translation 2012 Tiedemann et al.
Taskmaster -2 03.29.20 English Dataset consists of 17,289 dialogs in seven domains: restaurants (3276), food ordering (1050), movies (3047), hotels (2355), flights (2481), music (1602), and sports (3478). It consists entirely of spoken two-person dialogs. 17,289 JSON Dialogue 2020 Byrne et al.
Taskmaster-1 01.15.20 English Dataset contains 13,215 task-based dialogs, including 5,507 spoken and 7,708 written dialogs created with two distinct procedures. Each conversation falls into one of six domains: ordering pizza, creating auto repair appointments, setting up ride service, ordering movie tickets, ordering coffee drinks and making restaurant reservations. 13,215 JSON Dialogue 2019 Byrne et al.
Tatoeba 06.25.20 Multi-Lingual Dataset is a collection of sentences and translations. 8.5M CSV Machine Translation 2020 Tatoeba
Ten Thousand German News Articles Dataset (10kGNAD) 01.15.20 German Dataset consists of 10273 german language news articles from an austrian online newspaper categorized into nine topics. 10,273 CSV Text Corpora 2019 Timo Block
Tencent AI Lab Embedding Corpus 01.15.20 Chinese Dataset provides 200-dimension vector representations, a.k.a. embeddings, for over 8 million Chinese words and phrases. 8M Text Embeddings 2018 Song et al.
TextVQA 01.15.20 English TextVQA requires models to read and reason about text in images to answer questions about them. Specifically, models need to incorporate a new modality of text present in the images and reason over it to answer TextVQA questions. 36,602 JSON, PNG Question Answering, Visual, Commonsense 2019 Singh et al.
Textbook Question Answering 01.15.20 English The M3C task builds on the popular Visual Question Answering (VQA) and Machine Comprehension (MC) paradigms by framing question answering as a machine comprehension task, where the context needed to answer questions is provided and composed of both text and images. 26,620 JSON, PNG Question Answering, Reading Comprehension, Visual 2017 Kembhavi et al.
Textual Visual Semantic Dataset 05.04.20 English A dataset consisting of detecting and recognizing text appearing in images (e.g. signboards, traffic signals or brands in clothing or objects). Around 82,000 images. 82,000 JPG, CSV Automatic Image Captioning 2020 Sabir et al.
The Arabic Parallel Gender Corpus 03.29.20 Arabic Dataset is designed to support research on gender bias in natural language processing applications working on Arabic. Requires to submit application for approval. ~12,000 n/a Gender Identification 2019 Habash et al.
The Benchmark of Linguistic Minimal Pairs (BLiMP) 01.15.20 English BLiMP is a challenge set for evaluating what language models (LMs) know about major grammatical phenomena in English. 67 sub-datasets each with 1,000 minimal pairs JSON Language Modeling 2019 Warstadt et al.
The Conversational Intelligence Challenge 2 (ConvAI2) 01.15.20 English A chit-chat dataset based on PersonaChat dataset. 3,127 JSON Dialogue 2018 NeurIPS
The Corpus of Linguistic Acceptability (CoLa) 01.15.20 English Dataset used to classifiy sentences as grammatical or not grammatical. 10,657 TSV Grammatical Acceptability 2018 Warstadt et al.
The Cross-lingual Natural Language Inference corpus (XNLI) 03.29.20 Multi-Lingual Dataset contains collection of 5,000 test and 2,500 dev pairs for the MultiNLI corpus. The pairs are annotated with textual entailment and translated into 14 languages: French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili and Urdu. 112,500 JSON, Text Entailment 2018 Conneau et al.
The Dialog-based Language Learning Dataset 01.15.20 English Dataset was designed to measure how well models can perform at learning as a student given a teacher’s textual responses to the student’s answer. n/a Text Question Answering, Reading Comprehension 2016 Weston
The EUR-Lex Dataset 05.04.20 Multi-Lingual Dataset is a collection of documents about European Union law.​ It contains many different types of documents, including treaties, legislation, case-law and legislative proposals, which are indexed with almost 4,000 labels. n/a HMTL Classification 2010 Mencía et al.
The Emotion in Text 01.21.20 English Dataset of tweets labelled with emotion. Categories: empty, sadness, enthusiasm, neutral, worry, sadness, love, fun, hate, happiness, relief, boredom, surprise, anger. 40,000 CSV Emotion Classification 2016 CrowdFlower
The Irish Times IRS 01.15.20 English Dataset contains 23 years of events from Ireland. 1,425,460 CSV Clustering, Events, Language Detection 2018 Rohit Kulkarni
The Movie Dialog Dataset 01.15.20 English Dataset measures how well models can perform at goal and non-goal orientated dialogue centered around the topic of movies (question answering, recommendation and discussion). ~3.5M Text Question Answering, Reading Comprehension 2016 Dodge et al.
The New York Times Annotated Corpus 05.26.20 English Dataset contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata provided by the New York Times Newsroom. 1.8M XML Summarization, Information Extraction 2008 Sandhaus et al.
The NewsReader MEANTIME Corpus 02.16.20 Multi-Lingual 480 news articles: 120 English Wikinews articles on four topics (i.e. Airbus and Boeing, Apple Inc., Stock market, and General Motors, Chrysler and Ford) and their translations in Spanish, Italian, and Dutch. Annotated with entities, events, temporal, semantic roles and event/entity coreference. 480 XML, NAF Named Entity Recognition (NER) 2016 Minard et al.
The Penn Treebank Project 01.15.20 English Naturally occurring text annotated for linguistic structure. ~1M words Text POS 1995 Marcus et al.
The SimpleQuestions Dataset 01.15.20 English Dataset for question answering with human generated questions paired with a corresponding fact, formatted as (subject, relationship, object), that provides the answer but also a complete explanation. 108,442 Text Question Answering, Reading Comprehension 2015 Bordes et al.
The Stanford Sentiment Treebank (SST) 01.15.20 English Sentence sentiment classification of movie reviews. 69,000 PTB Classification, Sentiment Analysis 2013 Socher et al.
The Story Cloze Test | ROCStories 01.15.20 English Dataset for story understanding that provides systems with four-sentence stories and two possible endings. The systems must then choose the correct ending to the story. 100,000+ JSON Question Answering, Reading Comprehension 2017 Mostafazadeh et al.
The TAC Relation Extraction Dataset (TACRED) 03.29.20 English A relation extraction dataset containing 106k+ examples covering 42 TAC KBP relation types. Costs $25 for non-members. 106,264 CoNLL, JSON Relation Extraction 2017 Yuhao et al.
The WikiMovies Dataset 01.15.20 English Dataset contains only the QA part of the Movie Dialog dataset, but using three different settings of knowledge: using a traditional knowledge base (KB), using Wikipedia as the source of knowledge, or using IE (information extraction) over Wikipedia. ~100,000 Text Question Answering, Reading Comprehension 2016 Miller et al.
The Winograd Schema Challenge 01.15.20 Multi-Lingual Dataset to determine the correct referrent of the pronoun from among the provided choices. 150 XML Coreference Resolution 2012 Levesque et al.
ToTTo 05.04.20 English Dataset is used for the controlled generation of descriptions of tabular data comprising over 100,000 examples. Each example is a aligned pair of a highlighted table and the description of the highlighted content. 120,000+ JSON Table-to-Text 2020 Parikh et al.
Topical-Chat 01.15.20 English A knowledge-grounded human-human conversation dataset where the underlying knowledge spans 8 broad topics and conversation partners don’t have explicitly defined roles. 10,784 JSON Dialogue 2019 Gopalakrishnan et al.
Total-Text-Dataset 01.15.20 English Dataset used to classify curved text in pictures. ~1,500 JPG Scene Text Detection 2019 Ch'ng et al.
Train-O-Matic Large 03.29.20 Multi-Lingual Automatically-generated corpora in multiple languages with sense annotations for nouns using WordNet for English and BabelNet for all other languages as inventories of senses. 10M+ XML Word Sense Disambiguation  2018 Pasini et al.
Train-O-Matic Small 03.29.20 Multi-Lingual Automatically-generated corpora in multiple languages with sense annotations for nouns using WordNet for English and BabelNet for all other languages as inventories of senses. 1M+ XML Word Sense Disambiguation  2017 Pasini et al.
Translation-Augmented-LibriSpeech-Corpus (Libri-Trans) 03.05.20 English, French Dataset is an augmentation of LibriSpeech ASR and contains English utterances (from audiobooks) automatically aligned with French text. It offers ~236h of speech aligned to translated text. ~236 Hours Text, WAV Speech Translation 2018  Kocabiyikoglu et al.
Trec CAR Dataset 02.16.20 English Dataset contains topics, outlines, and paragraphs that are extracted from English Wikipedia (2016 XML dump). Wikipedia articles are split into the outline of sections and the contained paragraphs. ~285,000 CBOR Information Retrieval 2019 Dietz et al.
TrecQA 01.15.20 English Dataset is commonly used for evaluating answer selection in question answering. n/a XML Question Answering, Reading Comprehension 2007 Wang et al.
TriviaQA 01.15.20 English Dataset containing over 650K question-answer-evidence triples. It includes 95K QA pairs authored by trivia enthusiasts and independently gathered evidence documents, 6 per question on average. 650,000+ JSON Question Answering, Reading Comprehension 2017 Joshi et al.
Tumblr GIF (TGIF) 06.25.20 English Dataset contains 100K animated GIFs and 120K sentences describing visual content of the animated GIFs. 100,000 TSV Image Description Generation 2016 Li et al.
TupleInf Open IE Dataset 01.15.20 English Dataset contains Open IE tuples extracted from 263K sentences that were used by the solver in "Answering Complex Questions Using Open Information Extraction" (referred as Tuple KB, T). 263,000 Text Knowledge Base 2017 Allen Institute
Twenty Newsgroups Dataset 01.15.20 English Messages from 20 different newsgroups. 20,000 Text Classification, Clustering 1999 Mitchell et al.
Twitter Chat Corpus 01.29.20 English Dataset contains Twitter question-answer pairs. 5M Text Dialogue 2017 Marsan Ma
Twitter Dataset for Arabic Sentiment Analysis 01.15.20 Arabic Dataset contains Arabic tweets. 2,000 Text Classification, Sentiment Analysis 2014 Abdulla
Twitter US Airline Sentiment 01.15.20 English Contributors were asked to classify positive, negative, and neutral tweets, followed by categorizing negative reasons. 14,500 CSV Classification, Sentiment Analysis 2016 Figure Eight
Twitter100k 01.15.20 English Pairs of images and tweets. 100,000 Text and Images Multi-Modal Learning 2017 Hu et al.
TyDi QA 02.06.20 Multi-Lingual TyDi QA includes question-answer pairs from 11 languages: Arabic, Bengali, English, Finnish, Indonesian, Kiswahili, Russian. Japanese, Korean, Thai, and Telugu. 204,000 JSON Question Answering, Reading Comprehension 2020 Clark et al.
Ubuntu Dialogue Corpus 01.15.20 English Dialogues extracted from Ubuntu chat stream on IRC. 930,000  CSV Text Corpora, Dialogue 2015 Lowe et al.
United Nations Parallel Corpus 02.06.20 Multi-Lingual Parallel corpus presented consists of manually translated UN documents from the last 25 years (1990 to 2014) for the six official UN languages: Arabic, Chinese, English, French, Russian, and Spanish. 799,276 TEI, XML Machine Translation 2016 Ziemski et al.
Urban Dictionary Dataset 01.15.20 English Corpus of words, votes and definitions. 2,606,522 CSV Reading Comprehension 2016-05 Anonymous
UrbanSound & UrbanSound8K 06.25.20 đŸ€· UrbanSound: Dataset contains 1,302 labeled sound recordings. Each recording is labeled with the start and end times of sound events from 10 classes: air_conditioner, car_horn, children_playing, dog_bark, drilling, enginge_idling, gun_shot, jackhammer, siren, and street_music. UrbanSound8K: Dataset contains 8,732 labeled sound excerpts (<=4s) of urban sounds from 10 classes: air_conditioner, car_horn, children_playing, dog_bark, drilling, enginge_idling, gun_shot, jackhammer, siren, and street_music. 10,034 Wav, JSON, CSV Acoustic Classification 2014 Salamon et al.
Urhobo Text 05.26.20 Urhobo, English Dataset is a parallel dataset containing 10.3M tokens. n/a Text, XML Text Corpora, Machine Translation 2019 Ruoho Ruotsi
UseNet Corpus 01.15.20 English UseNet forum postings. 7B Text Dialogue 2011 Shaoul et al.
VIdeO-and-Language INference (VIOLIN) 05.04.20 English Dataset contains 95,322 video-hypothesis pairs from 15,887 video clips, spanning over 582 hours of video (YouTube and TV shows). Inference descriptions of video content were annotated. Inferences are used to measure entailment vs video clip. 15,887 JSON, H5 Multi-Modal Learning 2020 Liu et al.
VQA-Introspect 06.25.20 English Dataset consists of 238K new perception questions from the VQA dataset which serve as sub questions corresponding to the set of perceptual tasks needed to answer complex reasoning questions. 238,000 JSON Question Answering, Visual 2020 Selvaraju et al.
Video Commonsense Reasoning (VCR) 01.15.20 English Dataset contains 290K multiple-choice questions on 110K images. 290,000 JSON, JPG Question Answering, Visual, Commonsense 2018 Zellers et al.
VisDial 01.29.20 English Dataset contains images from COCO training set, and dialogues. Meant to be used for model to be trained in answering questions about images during conversation. Contains 1.2M dialog question-answers. 1.2M JSON Question Answering, Visual, Dialogue 2017 Das et al.
Visual QA (VQA) 01.15.20 English Dataset containing open-ended questions about images. These questions require an understanding of vision, language and commonsense to answer. 265,016 images JSON Visual Question Answering 2015 Antol et al.
Visual Storytelling Dataset (VIST) 05.04.20 English Dataset contains 81,743 unique photos in 20,211 sequences, aligned to descriptive and story language. VIST is previously known as "SIND", the Sequential Image Narrative Dataset (SIND). 81,743 JSON Multi-Modal Learning 2016 Huang et al.
Voices Obscured in Complex Environmental Settings (VOiCES) 01.15.20 English Dataset contains a total of 15 hours (3,903 audio files) in male and female read speech. n/a Wav Speech Recognition 2018 Various
VoxCeleb 01.15.20 Multi-Lingual An audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube. n/a MD5, URL Speech Recognition, Visual 2017 Nagrani et al.
VoxClamantis 06.25.20 Multi-Lingual Dataset contains phoneme-level alignments for more than 600 languages, high-resource alignments for ~50 languages, and phonetic measures for all vowels and sibilants. Consists of 690 audio readings of the New Testament of the Bible. 690 CSV Phonetic Typology 2020 Salesky et al.
VoxForge 06.25.20 Multi-Lingual Dataset consisting of speech audio clips submitted by the community involving several different languages. Dataset is constantly updated. n/a Wav, MFC Speech Corpora 2020 VoxForge
W-NUT 2017 05.04.20 English Dataset containing tweets, reddit comments, YouTube comments, and StackExchange were annotated with 6 entities: Person, Location, Corporation, Consumer good, Creative work, and Group. 2,295 CoNLL Named Entity Recognition (NER) 2017 Derczynski et al.
WAT 2019 Hindi-English 03.29.20 Hindi, English Dataset consists of multimodal English-to-Hindi translation. It inputs an image, rectangular region in the image and english caption. It outputs a caption in Hindi. 32,925 Text, JPEG Machine Translation, Multi-Modal Learning 2019 Parida et al.
WMT 14 English-German 01.15.20 Multi-Lingual Sentence pairs for translation. 4.5M Text Machine Translation 2015 Stanford
WMT 15 English-Czech 01.15.20 Multi-Lingual Sentence pairs for translation. 15.8M Text Machine Translation 2016 Stanford
WMT 19 Multiple Datasets 01.15.20 Multi-Lingual Multiple text corpora in multiple languages. n/a Text Text Corpora, Machine Translation 2019 ACL Workshop
WN18RR 05.26.20 English Dataset contains knowledge base relation triples from WordNet. 11 relations, 40,943 entities Text Relation Prediction 2018 Dettmers
WSD English All-Words Fine-Grained Datasets 03.29.20 English Unified five standard all-words Word Sense Disambiguation datasets. 7,000+ XML Word Sense Disambiguation  2017 Raganato et al.
WSJ0 Hipster Ambient Mixtures (WHAM!) 06.25.20 English Dataset consists of two speaker mixtures from the wsj0-2mix dataset combined with real ambient noise samples. The samples were collected in coffee shops, restaurants, and bars in the San Francisco Bay Area. 81 Hours Wav Speech Seperation 2019 Wichern et al.
Watan-2004 Corpus 03.29.20 Arabic Dataset contains about 20,000 articles talking about 6 topics: culture, religion, economy, local news, international news and sports. 20,000 HTML Text Corpora 2004 Abbas et al.
Web of Science Dataset 01.15.20 English Hierarchical Datasets for Text Classification. 46,985 Text Classification 2017 Kowsari et al.
WebNLG (Enriched) 05.26.20 German, English Dataset consists of 25,298 (data,text) pairs and 9,674 distinct data units. The data units are sets of RDF triples extracted from DBPedia and the texts are sequences of one or more sentences verbalising these data units. 25,298 XML Text Generation 2017 Gardent et al.
WebQuestions 05.04.20 English Dataset contains 6,642 question/answer pairs. The questions are supposed to be answerable by Freebase, a large knowledge graph. The questions are mostly centered around a single named entity. 6,642 JSON Question Answering, Knowledge Base 2013 Berant et al.
WebQuestions Semantic Parses Dataset 01.15.20 English Dataset contains full semantic parses in SPARQL queries for 4,737 questions, and “partial” annotations for the remaining 1,073 questions for which a valid parse could not be formulated or where the question itself is bad or needs a descriptive answer. 5,810 JSON Semantic Parsing 2016 Yih et al.
Webis-CLS-10 03.29.20 Multi-Lingual The Cross-Lingual Sentiment (CLS) dataset comprises about 800,000 Amazon product reviews in the 4 languages: English, German, French, and Japanese. 800,000 Tar Classification, Sentiment Analysis 2010 Prettenhofer et al.
Webis-Snippet-20 Corpus 03.29.20 English Dataset comprises four abstractive snippet dataset from ClueWeb09, Clueweb12, and DMOZ descriptions. More than 10 million <webpage, abstractive snippet> pairs / 3.5 million <query, webpage, abstractive snippet> pairs were collected. 3.5M JSON Summarization 2020 Chen et al.
Webis-TLDR-17 Corpus 03.29.20 English Dataset contains 3 Million pairs of content and self-written summaries mined from Reddit. It is one of the first large-scale summarization dataset from the social media domain. 3,084,410 JSON Summarization 2017 Volske et al.
Web Inventory of Transcribed and Translated Talks (WIT3) 01.29.20 Multi-Lingual Dataset contains a collection of transcribed and translated talks. The core of the dataset is from Ted Talks corpus. As of 2016, It holds 109 languages. n/a XML Machine Translation 2012 Cettolo et al.
Who Did What Dataset 01.15.20 English Dataset contains over 200,000 fill-in-the-gap (cloze) multiple choice reading comprehension problems constructed from the LDC English Gigaword newswire corpus. 200,000K XML Question Answering, Reading Comprehension 2016 Onishi et al.
WikiAnn 03.29.20 Multi-Lingual Dataset with NER annotations for PER, ORG and LOC. It has been constructed using the linked entities in Wikipedia pages for 282 different languages. 95,924 JSON Named Entity Recognition (NER) 2017 Pan et al.
WikiBio 05.26.20 English Dataset contains 728,321 biographies from wikipedia. For each article, it provides the first paragraph and the infobox (both tokenized). 728,321 n/a Data-to-Text 2016 Lebret et al.
WikiHow 01.15.20 English Dataset contains article and summary pairs extracted and constructed from an online knowledge base written by different human authors. 230,000+ Text Text Corpora, Summarization 2018 Koupaee et al.
WikiLinks 01.15.20 English Dataset contains 40 million mentions over 3 million entities based on hyperlinks from Wikipedia. ~10M Text Text Corpora 2012 Singh et al.
WikiMatrix 02.16.20 Multi-Lingual Dataset contains 135 million parallel sentences for 1,620 different language pairs in 85 different languages. 135M TSV Machine Translation 2019 Schwenk et al.
WikiQA Corpus 01.15.20 English Dataset contains Bing query logs as the question source. Each question is linked to a Wikipedia page that potentially has the answer.  3,047 TSV Question Answering, Reading Comprehension 2015 Yang et al.
WikiReading 01.29.20 Multi-Lingual The task is to predict textual values from the structured knowledge base Wikidata by reading the text of the corresponding Wikipedia articles. Includes English, Russian and Turkish. 18M JSON Knowledge Base, NLU 2016 Hewlett & Kenter et al.
WikiSQL 02.16.20 English A large collection of automatically generated questions about individual tables from Wikipedia. 80,654 JSON Semantic Parsing, Text-to-SQL 2017 Zhong et al.
WikiSplit 02.16.20 English Dataset contains 1 million English sentences, each split into two sentences that together preserve the original meaning, extracted from Wikipedia edits. 1M TSV Sentence Simplification 2018 Botha et al.
WikiTablesQuestions 05.04.20 English Dataset is for the task of question answering on semi-structured HTML tables. 22,033 TSV Question Answering, Semantic Parsing 2015 Pasupat et al.
WikiText-103 & 2 02.06.20 English Dataset contains word and character level tokens extracted from Wikipedia 100M+ TOKENS Language Modeling 2016 Merity et al.
WikiText-TL-39 05.26.20 Filipino Dataset is a large scale, unlabeled text dataset with 39M tokens in the training set. n/a Text Text Corpora, Language Modeling 2019 Cruz et al.
Wikidata NE dataset 02.06.20 English, German Dataset has 2 parts: the Named Entity files and the link files. The Named Entity files include the most important information about the entities, whereas the link files contain the links and ids in other databases. n/a JSON Named Entity Recognition, Knowledge Base 2017 Geiß et al.
Wikipedia 02.16.20 English The 2016-12-21 dump of English Wikipedia. 5,075,182 SQL Text Corpora 2016 Facebook Research
Wikipedia Current Events Portal (WCEP) Dataset 06.25.20 English Dataset is used for multi-document summarization (MDS) and consists of short, human-written summaries about news events, obtained from the Wikipedia Current Events Portal (WCEP), each paired with a cluster of news articles associated with an event. 10,200 JSON Summarization 2020 Ghalandari et al.
Wikipedia News Corpus 03.29.20 English Text from Wikipedia's current events page with dates. ~25,000 Text Text Corpora 2019 Parth Parikh
Will-They-Won't-They (WT-WT) 05.26.20 English Dataset of English tweets targeted at stance detection for the rumor verification task. 51,284 JSON Stance Detection 2020 Conforti et al.
WinoGrande 01.29.20 English Formulated as a fill-in-a-blank task with binary options, the goal is to choose the right option for a given sentence which requires commonsense reasoning. 44,000 JSON Commonsense Reasoning 2019 Sakaguchi et al.
Winogender Schemas 01.15.20 English Dataset with pairs of sentences that differ only by the gender of one pronoun in the sentence, designed to test for the presence of gender bias in automated coreference resolution systems. 720 TSV Coreference Resolution 2018 Rudinger et al.
Wisesight Sentiment Corpus 02.06.20 Thai Dataset contains around 26,700 messages in Thai language from various social media with human-annotated sentiment classification (positive, neutral, negative, and question). ~26,700 Text Classification, Sentiment Analysis 2019 Wisesight
WordNet 06.25.20 English Dataset is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. n/a n/a Knowledge Base 2006 George Miller
Words in Context 01.15.20 English Dataset for evaluating contextualized word representations. 2,400 Text Word Sense Disambiguation 2019 Pilehvar et al.
Worldtree Corpus 06.25.20 English Dataset contains multi-hop question answering/explanations where questions require combining between 1 and 16 facts (average 6) to generate detailed explanations for question answering inference. Each explanation is represented as a lexically-connected “explanation graph” that combines an average of 6 facts drawn from a semi-structured knowledge base of 9,216 facts across 66 tables. 5,114 Text, TSV Question Answering, Knowledge Base 2020 Xie et al.
Worldwide News - Aggregate of 20K Feeds 01.15.20 Multi-Lingual One week snapshot of all online headlines in 20+ languages. 1,398,431 CSV Clustering, Events, Machine Translation 2017 Rohit Kulkarni
X-Stance 03.29.20 Multi-Lingual Dataset contains more than 150 political questions, and 67k comments written by candidates on those questions. The questions are available in German, French, Italian and English. 67,000 JSON Stance Detection 2020 Vamvas et al.
X-Sum 03.05.20 English The XSum dataset consists of 226,711 Wayback archived BBC articles (2010 to 2017) and covering a wide variety of domains: News, Politics, Sports, Weather, Business, Technology, Science, Health, Family, Education, Entertainment and Arts. 226,711 JSON Summarization 2018 Narayan et al.
XQuAD 03.05.20 Multi-Lingual Dataset consists of a subset of 240 context paragraphs and 1,190 question-answer pairs from the development set of SQuAD v1.1 with their translations in 10 languages: Spanish, German, Greek, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, and Hindi. 1,190  JSON Question Answering, Reading Comprehension 2019 Artetxe et al.
Yahoo! Music User Ratings of Musical Artists 01.15.20 English Over 10M ratings of artists by Yahoo users. May be used to validate recommender systems or collaborative filtering algorithms. ~10M Text Clustering, PCA 2004 Yahoo!
Yelp Open Dataset 01.15.20 English Dataset containing millions of reviews on Yelp. In addition it contains business data including location data, attributes, and categories. 6,685,900 JSON Classification, Sentiment Analysis 2015 Yelp
Yelp Polarity Reviews 06.25.20 English Dataset contains 1,569,264 samples from the Yelp Dataset Challenge 2015. This subset has 280,000 training samples and 19,000 test samples in each polarity. Dataset from FastAI's website. 1,569,264 CSV Sentiment Analysis 2015 Zhang et al.
Yoruba Text 05.26.20 Yoruba Multiple datasets scraped together for the Yoruba language. n/a Text Text Corpora, Machine Translation 2018 Ruoho Ruotsi
YouTube Comedy Slam Preference Dataset 01.15.20 English User vote data for pairs of videos shown on YouTube. Users voted on funnier videos. 1,138,562 Text Classification 2012 Google
arXiv Bulk Data 01.15.20 English A collection of research papers on arXiv. n/a Tar Text Corpora 2011 n/a
bAbI 20 Tasks 01.15.20 English, Hindi Dataset cotains a set of contexts, with multiple question-answer pairs available based on the contexts. 2,000 Text Question Answering, Reading Comprehension 2015 Weston et al.
babI 6 Tasks Dialogue 01.15.20 English Dataset contains 6 tasks for testing end-to-end dialog systems in the restaurant domain. 3,000 Text Dialogue 2017 Bordes et al.
e-SNLI 06.25.20 English Dataset contains human-annotated natural language explanations of the entailment relations. n/a CSV Natural Language Inference (NLI) 2018 Camburu et al.
emrQA 05.26.20 English Dataset contains 1M question-logical form and 400,000+ question answer evidence pairs on electronic medical records. In total, there are 2,495 clinical notes. 2,495 CSV Question Answering, Reading Comprehension 2018 Pampari et al.
mTOR Pathway Corpus 05.26.20 English Dataset contains 1,300 annotated event instances of protein associations and dissociation reactions. 1,300 Text Information Extraction, Entity Extraction, Event Extraction 2011 Ohta et al.
