1. Data Integration and Preprocessing

Goal: Create harmonized, comparable datasets for linguistic and semantic analysis.

We worked with two datasets: the New Yorker Caption Contest (NYCC) and the Oxford Humor in Context (OHIC). These datasets compile images and associated captions, where humans have voted to give a score based on how funny the captions are. While both datasets share a similar purpose, they differ in structure and content, requiring preprocessing to align them for analysis.

Data Loading

To streamline the data loading process, we implemented two helper classes:

  • NewYorkerDataset: Loads the NYCC dataset and extracts relevant columns such as image_id, rank, caption, mean, precision, votes, and vote categories (not_funny, somewhat_funny, funny).
  • OxfordDataset: Loads the OHIC dataset and extracts columns such as image_id, rank, caption, and funny_score.

Preprocessing Steps

After loading the datasets, we applied the following preprocessing steps:

  • Text Cleaning: Lowercased text, removed punctuation, and filtered stopwords.
  • Tokenization: Prepared captions for downstream analysis.
  • Feature Engineering: Extracted word and character counts, and performed sentiment analysis.

Dataset Validation

We validated the NYCC dataset by checking for missing caption files and images. This step ensured that the dataset was complete and ready for analysis. We identified a few missing files, which were documented for transparency.

Data Integration and Preprocessing

2. Thematic NLP and Semantic Modeling

Goal: Classify captions into predefined humor types and recurring topic themes to better understand the datasets.

We used advanced NLP techniques to analyze captions from the datasets. Two main tasks were performed:

Humor Type Classification

We categorized captions into six predefined humor types: Affiliative, Sexual, Offensive, Irony/Satire, Absurdist, and Dark. These categories were chosen to ensure clear distinctions and reduce ambiguity.

We used the following methods for humor type classification:

  • Zero-Shot Classification: Used facebook/bart-large-mnli to assign humor types based on cosine similarity with predefined examples.
  • Sentence Embeddings: Used sentence-transformers/all-MiniLM-L6-v2 to generate embeddings and calculate similarity scores for humor type assignment.

These methods enabled robust and consistent humor type assignments, even for captions with ambiguous or overlapping styles.

Topic Classification

To better understand the thematic content of captions, we classified them into predefined topics (e.g., Love, Family, School, Work, Politics and Social Issues, Nature, Food, Emotions, Entertainment and Pop Culture, Caption Contest, Health, Law, Other).

We used the following methodology for topic classification:

  • Generated embeddings for captions and topic descriptions using sentence-transformers/all-MiniLM-L6-v2.
  • Calculated cosine similarity between caption embeddings and topic embeddings to assign topics.
  • Filtered noisy matches by retaining only the top 25% highest matching scores per topic, ensuring high-quality classifications.

This approach allowed us to identify the most recurrent themes in the datasets and analyze their relationships with funniness scores and ranks.

3. Stylistic Analysis

Goal: Analyze linguistic characteristics to highlight differences in tone, structure, and word usage between NYCC and OHIC captions.

  • Data Preprocessing: Filtered captions to remove excessively long entries and cleaned text by removing unnecessary characters.
  • Word Length Analysis: Computed caption lengths in characters and words, and visualized distributions using histograms and boxplots.
  • Part-of-Speech Analysis: Tagged captions with part-of-speech categories and compared their densities across datasets.
  • Formality Score Analysis: Calculated formality scores using the F-score metric and compared them across datasets using visualizations.

This methodology provided a detailed framework for analyzing stylistic differences between NYCC and OHIC captions.

Stylistic Analysis

Explore More

Dive deeper into our pipeline, datasets, and sources: