Texting and Conversation Summarization Program Using ML
April 21, 2024
Overview
This project implements a Natural Language Processing (NLP) system that summarizes long histories of text message conversations. The system operates through four main stages:
- Clustering messages
- Extracting topics
- Classifying messages by topic
- Generating summaries for each topic
The tool is built using unsupervised learning techniques and was tested using WhatsApp group chats from FIU student and club discussions.
Implementation Steps
1. Preprocessing
- WhatsApp chat logs are exported as
.txt
files. - Messages are cleaned using regex to remove emojis and non-standard characters.
- Data is converted into CSV with timestamps, message content, and sender info.
2. Clustering Messages
- Messages are grouped using Agglomerative Clustering based on time proximity (5-second window).
- Messages within a short time frame are treated as part of the same conversation and concatenated.
3. Topic Modeling with BERTopic
- BERTopic is used to generate topic clusters and assign confidence scores.
- Stopwords are removed after embedding generation to preserve contextual understanding.
- Topic information is merged back into the original message DataFrame.
4. Classifying Outliers with K-NN
- BERTopic may label some messages as outliers.
- These are classified using a K-Nearest Neighbors classifier based on message timestamp.
5. Summarization with GPT-3.5 Turbo
- Representative messages per topic are passed to GPT-3.5 with two prompts:
- Summarization Prompt: “Briefly explain what the text is about in 5 sentences.”
- Refined Title Prompt: “Reply with a short title that summarizes the content.”
- These outputs create readable summaries and human-friendly topic titles.
Results
Topic Quality
- A test on 2400 messages from an FIU Capstone 2 group chat yielded 26 diverse and coherent topics.
- Topic diversity score: 0.9, indicating minimal redundancy.
- Examples:
nextcloud_app_project_integration
– discussions around the use of the Nextcloud platform.cs_member_group_looking
– students searching for project teammates.- Less meaningful clusters like
yes_yep_yeah_yup
also emerged due to repetitive affirmation messages.
Summary Quality
- GPT-3.5 summaries matched the message content accurately.
- Example for
nextcloud_app_project_integration
:- Refined Title: "Challenges and Frustrations with Nextcloud Integration"
- Summary: Captures confusion, setup issues, collaboration efforts, and emotional responses to working with the Nextcloud platform.
Conclusion
The project successfully created an automated tool to summarize large-scale text conversations using:
- BERTopic for topic modeling,
- K-NN for outlier classification,
- GPT-3.5 for high-quality summarization.
This tool allows users to glean insight from long message histories without reading thousands of messages manually.