Models & Methods
Discover the methodologies and models used to analyze humor in our datasets.
Goal: Create harmonized, comparable datasets for linguistic and semantic analysis.
We worked with two datasets: the New Yorker Caption Contest (NYCC) and the Oxford Humor in Context (OHIC). These datasets compile images and associated captions, where humans have voted to give a score based on how funny the captions are. While both datasets share a similar purpose, they differ in structure and content, requiring preprocessing to align them for analysis.
To streamline the data loading process, we implemented two helper classes:
image_id, rank, caption, mean, precision, votes, and vote categories (not_funny, somewhat_funny, funny).image_id, rank, caption, and funny_score.After loading the datasets, we applied the following preprocessing steps:
We validated the NYCC dataset by checking for missing caption files and images. This step ensured that the dataset was complete and ready for analysis. We identified a few missing files, which were documented for transparency.
Goal: Classify captions into predefined humor types and recurring topic themes to better understand the datasets.
We used advanced NLP techniques to analyze captions from the datasets. Two main tasks were performed:
We categorized captions into six predefined humor types: Affiliative, Sexual, Offensive, Irony/Satire, Absurdist, and Dark. These categories were chosen to ensure clear distinctions and reduce ambiguity.
We used the following methods for humor type classification:
facebook/bart-large-mnli to assign humor types based on cosine similarity with predefined examples.sentence-transformers/all-MiniLM-L6-v2 to generate embeddings and calculate similarity scores for humor type assignment.These methods enabled robust and consistent humor type assignments, even for captions with ambiguous or overlapping styles.
To better understand the thematic content of captions, we classified them into predefined topics (e.g., Love, Family, School, Work, Politics and Social Issues, Nature, Food, Emotions, Entertainment and Pop Culture, Caption Contest, Health, Law, Other).
We used the following methodology for topic classification:
sentence-transformers/all-MiniLM-L6-v2.This approach allowed us to identify the most recurrent themes in the datasets and analyze their relationships with funniness scores and ranks.
Goal: Analyze linguistic characteristics to highlight differences in tone, structure, and word usage between NYCC and OHIC captions.
This methodology provided a detailed framework for analyzing stylistic differences between NYCC and OHIC captions.
Goal: Analyze how humor types and topics evolve over time and correlate with major sociopolitical and cultural events.
This methodology provided a comprehensive framework for understanding how external events and societal changes influence the thematic and stylistic focus of humor over time. By combining proportion analysis, z-score calculations, and heatmap visualizations, we captured both the macro-level trends and the nuanced shifts in humor styles and topics.