PublicMoodTracker/Legal/Sentiment Attribution & Methodology
AI/ML Current

Sentiment Attribution & Methodology

Last updated: December 2025

Transparency commitment: Every sentiment score on PublicMoodTracker is traceable to its source articles and model outputs. This document provides full methodological transparency so users can critically evaluate and contextualise the data they see.

PublicMoodTracker generates political sentiment scores using a reproducible, version-controlled AI pipeline. This document provides complete transparency on how scores are calculated, what they represent, and their known limitations — in compliance with emerging AI transparency obligations and good-faith commitment to responsible AI deployment in Kenya's democratic context.

1. What a Sentiment Score Represents

A PublicMoodTracker sentiment score for a politician represents the weighted average sentiment of public media discourse about that individual over a defined time window (24 hours, 7 days, or 30 days). It is derived from news articles, social media posts, and broadcast content — not from direct polling, surveys, or voter intention data.

A score of +0.72 means: on average, public media coverage of this politician is moderately positive. It does not mean 72% of Kenyans support them.

2. Data Ingestion Pipeline

2.1 Sources (9 active scrapers)

SourceTypeLanguageUpdate Frequency
Nation Media GroupOnline newsEnglish / SwahiliEvery 15 min
Standard Group DigitalOnline newsEnglishEvery 15 min
Citizen Digital (RMS)Online newsEnglish / SwahiliEvery 15 min
Tuko.co.keOnline tabloidEnglish / SwahiliEvery 15 min
The Star KenyaOnline newsEnglishEvery 15 min
Twitter / X (Kenyan political)Social mediaEnglish / Swahili / ShengReal-time stream
YouTube (KE political channels)Video commentsEnglish / Swahili / ShengEvery 30 min
The Elephant (blog)Long-form analysisEnglishDaily
Parliamentary HansardOfficial recordEnglishWeekly (session days)

2.2 Content Preprocessing

  1. De-duplication: Near-duplicate articles (cosine similarity > 0.92) are collapsed to prevent single stories dominating scores.
  2. Spam filtering: Content flagged by our spam classifier (precision: 94%) is excluded.
  3. Sentence segmentation: Articles are split into sentences; sentiment is scored per sentence, not per article.
  4. Language detection: langdetect library classifies language; Swahili and Sheng trigger the multilingual model path.

3. Named Entity Recognition (NER)

Our custom NER model identifies Kenyan political entities within text — politicians, counties, parties, and institutions. It is fine-tuned from bert-base-multilingual-casedon a labelled Kenyan political corpus of 28,000 sentences.

  • Mention linking: Aliases are resolved (e.g., "DP Ruto", "His Excellency Ruto", "Hustler" → William Ruto).
  • Ambiguity handling: Ambiguous names (e.g., common surnames) require county or party context; unresolvable ambiguities are discarded.
  • Minimum threshold: A politician requires ≥3 mentions within a scoring window to appear in rankings.

4. Sentiment Classification Model

4.1 Base Model

XLM-RoBERTa-large (Phase 13.1) — a transformer model supporting 100 languages, fine-tuned on 12,000 labelled Kenyan political sentences (English, Swahili, Sheng). Labels: POSITIVE, NEGATIVE, NEUTRAL.

4.2 Custom Lexicon Enhancement

A supplemental lexicon of 40 Swahili political terms and150 Sheng political terms (e.g., "hustler", "dynasty", "kamwana") modifies token embeddings before classification to improve accuracy for Kenyan-specific discourse.

4.3 Output

Each classified mention produces:

  • sentiment_label: POSITIVE / NEGATIVE / NEUTRAL
  • confidence: 0.0–1.0 (model softmax probability)
  • raw_score: +1.0 (positive), −1.0 (negative), 0.0 (neutral)

5. Score Aggregation Formula

5.1 Weighted Average Score

The displayed score for a politician over a time window is a confidence-weighted mean:

score = Σ(raw_score_i × confidence_i) / Σ(confidence_i)

Mentions with confidence < 0.55 are excluded to reduce noise from ambiguous text.

5.2 Polarity Confidence

Measures how consistently directional public discourse is (used in the Rivals comparison table):

polarity_confidence = |positive_count − negative_count| / total_mentions

A value near 1.0 means near-unanimous sentiment; near 0.0 means deeply polarised.

5.3 Mention Velocity (Spike Detection)

velocity = mentions_in_last_hour / avg_mentions_per_hour_last_7d spike_flag = True if velocity ≥ 3.0 and |score_delta_1h| ≥ 0.4

6. Score Display Rules

ConditionDisplay Behaviour
< 3 mentions in windowScore hidden; shown as "Insufficient data"
3–9 mentionsScore shown with ⚠ low-confidence warning
10–49 mentionsScore shown with confidence label
≥ 50 mentionsFull score with high-confidence badge

7. Known Limitations

  • Media ≠ public opinion: Scores reflect what journalists and social users write, not what the broader public thinks.
  • Sheng coverage gaps: Rapidly evolving Sheng slang may lag our lexicon updates by 2–4 weeks.
  • Satire misclassification: The model may misread satirical content as genuine sentiment; estimated error rate ~3%.
  • Source bias: Not all Kenyan media outlets are included; sources skew toward urban, online audiences.
  • Coordinated inauthentic behaviour: Social media manipulation (bot networks) can temporarily distort scores; we apply velocity filters but cannot guarantee full detection.

8. Model Versioning and Change Log

VersionDateKey Changes
Phase 13.1Nov 2025Sheng lexicon expanded to 150 terms; Hansard source added
Phase 12.0Jun 2025XLM-RoBERTa-large (replaced base); confidence threshold raised to 0.55
Phase 11.3Jan 2025County-level NER improved; duplicate detection threshold tuned

9. Requesting Source Evidence

Any registered user may click the transparency icon (🔍) on a politician's profile to view the source articles contributing to the current score. Users with a Daily Access subscription can export these citations as a reference list.

Academic researchers may request full methodology documentation by contacting research@siasaiq.com.