Texting and Conversation Summarization Program Using ML

April 21, 2024

Overview

This project implements a Natural Language Processing (NLP) system that summarizes long histories of text message conversations. The system operates through four main stages:

  1. Clustering messages
  2. Extracting topics
  3. Classifying messages by topic
  4. Generating summaries for each topic

The tool is built using unsupervised learning techniques and was tested using WhatsApp group chats from FIU student and club discussions.

Implementation Steps

1. Preprocessing

  • WhatsApp chat logs are exported as .txt files.
  • Messages are cleaned using regex to remove emojis and non-standard characters.
  • Data is converted into CSV with timestamps, message content, and sender info.

2. Clustering Messages

  • Messages are grouped using Agglomerative Clustering based on time proximity (5-second window).
  • Messages within a short time frame are treated as part of the same conversation and concatenated.

3. Topic Modeling with BERTopic

  • BERTopic is used to generate topic clusters and assign confidence scores.
  • Stopwords are removed after embedding generation to preserve contextual understanding.
  • Topic information is merged back into the original message DataFrame.

4. Classifying Outliers with K-NN

  • BERTopic may label some messages as outliers.
  • These are classified using a K-Nearest Neighbors classifier based on message timestamp.

5. Summarization with GPT-3.5 Turbo

  • Representative messages per topic are passed to GPT-3.5 with two prompts:
    • Summarization Prompt: “Briefly explain what the text is about in 5 sentences.”
    • Refined Title Prompt: “Reply with a short title that summarizes the content.”
  • These outputs create readable summaries and human-friendly topic titles.

Results

Topic Quality

  • A test on 2400 messages from an FIU Capstone 2 group chat yielded 26 diverse and coherent topics.
  • Topic diversity score: 0.9, indicating minimal redundancy.
  • Examples:
    • nextcloud_app_project_integration – discussions around the use of the Nextcloud platform.
    • cs_member_group_looking – students searching for project teammates.
    • Less meaningful clusters like yes_yep_yeah_yup also emerged due to repetitive affirmation messages.

Summary Quality

  • GPT-3.5 summaries matched the message content accurately.
  • Example for nextcloud_app_project_integration:
    • Refined Title: "Challenges and Frustrations with Nextcloud Integration"
    • Summary: Captures confusion, setup issues, collaboration efforts, and emotional responses to working with the Nextcloud platform.

Conclusion

The project successfully created an automated tool to summarize large-scale text conversations using:

  • BERTopic for topic modeling,
  • K-NN for outlier classification,
  • GPT-3.5 for high-quality summarization.

This tool allows users to glean insight from long message histories without reading thousands of messages manually.