Models & Methods

1. Data Integration and Preprocessing

Goal: Create harmonized, comparable datasets for linguistic and semantic analysis.

We worked with two datasets: the New Yorker Caption Contest (NYCC) and the Oxford Humor in Context (OHIC). These datasets compile images and associated captions, where humans have voted to give a score based on how funny the captions are. While both datasets share a similar purpose, they differ in structure and content, requiring preprocessing to align them for analysis.

Data Loading

To streamline the data loading process, we implemented two helper classes:

NewYorkerDataset: Loads the NYCC dataset and extracts relevant columns such as image_id, rank, caption, mean, precision, votes, and vote categories (not_funny, somewhat_funny, funny).
OxfordDataset: Loads the OHIC dataset and extracts columns such as image_id, rank, caption, and funny_score.

Preprocessing Steps

After loading the datasets, we applied the following preprocessing steps:

Text Cleaning: Lowercased text, removed punctuation, and filtered stopwords.
Tokenization: Prepared captions for downstream analysis.
Feature Engineering: Extracted word and character counts, and performed sentiment analysis.

Dataset Validation

We validated the NYCC dataset by checking for missing caption files and images. This step ensured that the dataset was complete and ready for analysis. We identified a few missing files, which were documented for transparency.

2. Thematic NLP and Semantic Modeling

Goal: Classify captions into predefined humor types and recurring topic themes to better understand the datasets.

We used advanced NLP techniques to analyze captions from the datasets. Two main tasks were performed:

Humor Type Classification

We categorized captions into six predefined humor types: Affiliative, Sexual, Offensive, Irony/Satire, Absurdist, and Dark. These categories were chosen to ensure clear distinctions and reduce ambiguity.

We used the following methods for humor type classification:

Zero-Shot Classification: Used facebook/bart-large-mnli to assign humor types based on cosine similarity with predefined examples.
Sentence Embeddings: Used sentence-transformers/all-MiniLM-L6-v2 to generate embeddings and calculate similarity scores for humor type assignment.

These methods enabled robust and consistent humor type assignments, even for captions with ambiguous or overlapping styles.

Topic Classification

To better understand the thematic content of captions, we classified them into predefined topics (e.g., Love, Family, School, Work, Politics and Social Issues, Nature, Food, Emotions, Entertainment and Pop Culture, Caption Contest, Health, Law, Other).

We used the following methodology for topic classification:

Generated embeddings for captions and topic descriptions using sentence-transformers/all-MiniLM-L6-v2.
Calculated cosine similarity between caption embeddings and topic embeddings to assign topics.
Filtered noisy matches by retaining only the top 25% highest matching scores per topic, ensuring high-quality classifications.

This approach allowed us to identify the most recurrent themes in the datasets and analyze their relationships with funniness scores and ranks.

3. Stylistic Analysis

Goal: Analyze linguistic characteristics to highlight differences in tone, structure, and word usage between NYCC and OHIC captions.

Data Preprocessing: Filtered captions to remove excessively long entries and cleaned text by removing unnecessary characters.
Word Length Analysis: Computed caption lengths in characters and words, and visualized distributions using histograms and boxplots.
Part-of-Speech Analysis: Tagged captions with part-of-speech categories and compared their densities across datasets.
Formality Score Analysis: Calculated formality scores using the F-score metric and compared them across datasets using visualizations.

This methodology provided a detailed framework for analyzing stylistic differences between NYCC and OHIC captions.

4. Temporal and Cultural Trend Analysis

Goal: Analyze how humor types and topics evolve over time and correlate with major sociopolitical and cultural events.

Data Grouping: Grouped captions by year and computed the normalized proportions of humor types and topics to track trends over time.
Proportion Analysis: Visualized yearly proportions of humor types and topics using line plots to identify dominant trends and shifts.
Z-Score Analysis: Calculated deviations from the mean topic proportions using z-scores to highlight significant variations over time.
Heatmap Visualization: Created heatmaps to visualize periods of overrepresentation or underrepresentation for each topic, aligning with external events.
Contextual Interpretation: Interpreted trends in humor types and topics in relation to key sociopolitical and cultural events, such as the U.S. presidential elections, the Covid pandemic, and the rise of internet culture.

This methodology provided a comprehensive framework for understanding how external events and societal changes influence the thematic and stylistic focus of humor over time. By combining proportion analysis, z-score calculations, and heatmap visualizations, we captured both the macro-level trends and the nuanced shifts in humor styles and topics.