New Yorker Caption Contest (NYCC)

The New Yorker Caption Contest dataset contains captions submitted to the weekly contest, along with their funniness scores and metadata.

Why NYCC?

The NYCC dataset provides a unique institutional perspective on humor. It allows us to analyze humor in a structured, community-driven environment where captions are explicitly rated for funniness. This dataset is particularly useful for:

  • Understanding the thematic and stylistic preferences of a specific audience.
  • Tracking the evolution of humor over time in a controlled setting.
  • Exploring the relationship between caption content and funniness scores.

Key Insights

The dataset spans from contest #510 (22nd February 2016) to contest #895 (6th May 2024), with weekly updates. Captions are rated by crowdsourced users, and the final winner is decided by The New Yorker editorial staff. Key insights include:

  • How many responses are gathered per contest?
  • How do ratings behave after users have seen many captions?
  • How well can natural language processing (NLP) tools find similar captions?

More details can be found at [1].

Preprocessing Steps

  1. Loaded the dataset and removed unnecessary columns.
  2. Standardized column formats (e.g., lowercase text, consistent date formats).
  3. Filtered captions with missing or invalid funniness scores.
  4. Tokenized and cleaned the text (e.g., removed punctuation and stopwords).

For more references, see [2] and [3].

New Yorker Caption Contest Dataset

Oxford Humor in Context (OHIC)

The Oxford Humor in Context dataset contains humorous image-text pairs with linguistic and social annotations.

Why OHIC?

The OHIC dataset provides a broader, general-audience perspective on humor. It includes a diverse range of humor types and contexts, making it ideal for:

  • Analyzing humor across different cultural and social contexts.
  • Exploring the relationship between image and text in humor.
  • Studying humor in a less formal, more organic setting.

Key Insights

OxfordTVG-HIC offers approximately 2.9M image-text pairs with humor scores, curated to avoid offensive content. The dataset is particularly useful for:

  • Training humor captioning models with diverse emotional and semantic content.
  • Evaluating humor generation and understanding in deep learning models.
  • Exploring visual and linguistic cues aligned with the benign violation theory of humor.

For more details, visit the References section in the Sources page and see [5] and [6].

Preprocessing Steps

  1. Loaded the dataset and filtered out non-humorous entries.
  2. Aligned the schema with the NYCC dataset.
  3. Cleaned and tokenized the text data.
  4. Extracted linguistic features (e.g., sentiment, topics).
Oxford Humor in Context Dataset

Explore More

Dive deeper into our models, methods, and findings: