Datasets
Explore the datasets used in our analysis, including their structure, sources, and key features.
The New Yorker Caption Contest dataset contains captions submitted to the weekly contest, along with their funniness scores and metadata.
The NYCC dataset provides a unique institutional perspective on humor. It allows us to analyze humor in a structured, community-driven environment where captions are explicitly rated for funniness. This dataset is particularly useful for:
The dataset spans from contest #510 (22nd February 2016) to contest #895 (6th May 2024), with weekly updates. Captions are rated by crowdsourced users, and the final winner is decided by The New Yorker editorial staff. Key insights include:
More details can be found at [1].
The Oxford Humor in Context dataset contains humorous image-text pairs with linguistic and social annotations.
The OHIC dataset provides a broader, general-audience perspective on humor. It includes a diverse range of humor types and contexts, making it ideal for:
OxfordTVG-HIC offers approximately 2.9M image-text pairs with humor scores, curated to avoid offensive content. The dataset is particularly useful for:
For more details, visit the References section in the Sources page and see [5] and [6].