Data and Statistics
Finding and accessing data and statistics across a range of disciplines.
General Data Collections
- DANSThe Dutch National Centre of Expertise and Repository for Research Data (DANS) data archive collection contains datasets in the fields of humanities, archaeology, geospatial sciences and behavioural and social sciences.
- Digtal Humanities Toychest | Data Collections & DatasetsA comprehensive collection of digital humanities datasets and tools, compiled by Dr. Alan Liu of UC Santa Barbara.
- Humanities DataHumanitiesdata.com seeks to help collect and disseminate information about publicly available data of particular interest to digital humanities and humanities computing. Humanitiesdata.com collects exclusively open datasets.
- Journal of Open Humanities DataThe Journal of Open Humanities Data (JOHD) features peer reviewed publications describing humanities data or techniques with high potential for reuse. The journal currently publishes two types of papers: short data papers that contain a concise description of a humanities research object with high reuse potential and full length research papers discuss and illustrate methods, challenges, and limitations in the creation, collection, management, access, processing, or analysis of data in humanities research, including standards and formats.
- Early English Books Online (EEBO) Text Creation Partnership (TCP)EEBO-TCP is a partnership with ProQuest and with more than 150 libraries to generate highly accurate, fully-searchable, SGML/XML-encoded texts corresponding to books from the Early English Books Online (EEBO) Database.
- Geographic Locations in English-Language Literature, 1701-2011This dataset contains metadata as well as data regarding geographic locations mentioned in works of fiction from 1701-2011 found in the HathiTrust Digital Library. The dataset comes in three versions: volumemeta, recordmeta, and titlemeta. The dataset contains over 30 columns of data for each volume row. Data in the dataset includes geographic location as it appears in the volume, number of times the location is mentioned in the volume, as well as the latitude and longitude for the location.
- Google Books Ngram DatasetThe complete corpus of approximately 5 million books digitized by Google.
- HTRC Extracted Features DatasetThe HTRC Extracted Features Dataset v.2.0 is composed of page-level features for 17.1 million volumes in the HathiTrust Digital Library. This version contains non-consumptive features for both public-domain and in-copyright books. Features include part-of-speech tagged term token counts, header/footer identification, marginal character counts, and much more.
- NLP DatasetsA collection of raw unstructured text data for use in natural language processing applications.
- Project GutenbergProject Gutenberg is an online library of free eBooks. Not a text corpora itself, it can be used to generated one for research.
- ProQuest TDM StudioA text and data mining platform for research at all levels and all disciplines for analysis of JHU library’s Proquest rights-cleared content.
- Word Frequencies in English-Language Literature, 1700-1922Genre-specific wordcounts for 178,381 volumes from the HathiTrust Digital Library. This dataset contains the word frequencies for all English-language volumes of fiction, drama, and poetry in the HathiTrust Digital Library from 1700 to 1922. Word counts are aggregated at the volume level, but include only pages tagged as belonging to the relevant literary genre.
- World-Historical DataverseThe World-Historical Dataverse is published by the World History Center at the University of Pittsburgh. It is intended to contribute to the development and exchange of datasets relevant to world-historical documentation and analysis. Modern Data Bank, Japan Historical Statistics, and Slave movements in the 18th and 19th century, amongst many others.
Art and Critical Theory
- The Getty Provenance IndexThe Getty Provenance Index® (GPI) provides access to archival inventories, sales catalogs, and dealer stock books.
- Bechdel Test in Film DatasetThe Beschdel test was evaluated on 1,615 films released from 1990 to 2013 to examine the relationship between the prominence of women in a film and that film’s budget and gross profits.