Harnessing LLMs and Snowflake for Scalable Sentiment Analysis

About the Author

Tejas Tumakuru Ashok Data Scientist | AI & Machine Learning Expert | Industry Mentor A recipient of the 2024 Data Science Excellence Award, he specializes in GenAI, machine learning, and NLP, with impactful contributions at firms like Travel + Leisure. As a published author and member of the Institute of Analytics (UK), he brings deep expertise in applied AI. Beyond work, he serves as a mentor, hackathon judge, and speaker, helping shape the next generation of AI talent.

In today’s experience-driven economy, understanding customer sentiment at scale is a strategic imperative—especially in service-heavy industries like hospitality. This guide details a robust, production-grade pipeline that leverages large language models (LLMs), Snowflake, and Databricks to transform raw survey feedback into structured, actionable insights. From data ingestion to visualization, the workflow demonstrates how modern AI infrastructure can replace manual effort, uncover hidden patterns, and drive measurable business impact.

1. Survey Data Source and Ingestion

The workflow begins with survey data collected by a third-party vendor. Surveys are distributed via email to a randomly sampled customer base, capturing feedback from recent or past experiences. The dataset comprises:

Quantitative Fields: Checkbox-style ratings (e.g., 1–5 scales for satisfaction metrics).
Qualitative Fields: Open-ended text comments providing detailed feedback.
Metadata: Includes submission timestamps and associated service locations.

The vendor pushes raw survey data into Snowflake, a cloud-based enterprise data warehouse optimized for structured and semi-structured data. The primary focus is on text entries categorized into three sentiment levels: Unhappy, Neutral, and Happy. These are stored in a dedicated Snowflake table, with schema fields for comment text, sentiment category, and metadata.

Challenges:

Data Volume: Surveys generate thousands of responses weekly, requiring efficient ingestion to avoid bottlenecks.
Data Quality: Inconsistent formatting (e.g., HTML tags, emojis) and missing values necessitate robust pre-processing.

2. Data Extraction and Pre-processing in Databricks

Databricks, integrated with Snowflake via a secure connector, serves as the primary environment for data extraction and pre-processing. A Spark-based pipeline extracts relevant fields (comment text, sentiment, metadata) from Snowflake using optimized SQL queries.

Pre-processing Steps

Filtering: Remove null comments and entries shorter than 10 characters to ensure meaningful input for the LLM.
Text Cleaning:
- Strip HTML tags using Python’s BeautifulSoup.
- Remove emojis and non-UTF-8 characters with regular expressions (re.sub).
- Normalize text by converting to lowercase and removing excessive whitespace.
Text Enhancement:
- Apply spell correction using pyspellchecker to handle common typos.
- Perform lemmatization with spaCy to standardize word forms (e.g., “running” → “run”).
Metadata Tagging: Attach location, timestamp, and sentiment labels to each record for downstream analysis.

Technical Details:

Spark Optimization: Data Frames are partitioned by sentiment category to parallelize pre-processing across a Databricks cluster (e.g., 8 nodes, 32 cores).
Error Handling: Custom logging captures pre-processing errors (e.g., malformed UTF-8) for manual review.
Challenges: Balancing pre-processing thoroughness with performance was critical. Over-aggressive cleaning risked losing contextual nuances, while under-cleaning led to noisy LLM inputs.

3. Batch Inference with Groq and Meta Llama 3.3 70B Versatile

To categorize comments into 10 predefined feedback categories (e.g., cleanliness, staff behavior, amenities, booking experience), the pipeline leverages Meta’s LLaMA 3.3 model via Groq’s ultra-fast inference API. The model’s efficiency and Groq’s low-latency infrastructure enable scalable batch processing.

Inference Workflow:

Prompt Design:
- Each comment is embedded in a structured prompt:
- Comment: {comment_text}.
- Task: Assign the comment to one or more of the following categories: {category_list}.Provide reasoning for the classification and suggest new categories if applicable.Output format: {category: [list], reasoning: [text], new_categories: [list]}
- Prompts are templated using Python’s string.Template for consistency.
Batch Processing:
- Comments are grouped into batches of ~500 records to optimize API throughput.
- Asynchronous calls to Groq’s API are managed using Python’s aiohttp library, with a maximum concurrency of 10 requests to avoid rate-limiting.
- Retries (up to 3 attempts) are implemented for handling transient API timeouts using exponential backoff.
Error Handling:
- Failed API calls are logged with comment IDs for reprocessing.
- Rate-limit errors trigger a dynamic pause (e.g., 5 seconds) before retrying.

Challenges:

Prompt Sensitivity: Vague or compound comments (e.g., “The room was dirty but the staff was great”) required careful prompt tuning to avoid misclassification.
API Costs: Balancing batch size and API call frequency was critical to manage costs while meeting processing deadlines.

4. Structuring and Storing Output

The LLM’s JSON-formatted responses are parsed into a structured format for storage and analysis:

Binary Matrix: A 10-column matrix with 1/0 flags indicating category assignments.
Reasoning Column: Free-text explanations of the model’s classification logic.
New Category Suggestions: A column capturing novel categories proposed by the LLM (e.g., “pet policy” for comments about pet-related issues).

Technical Details

Parsing: JSON responses are validated using pydantic to ensure schema compliance.
Storage: Structured data is written back to a Snowflake table using the snowflake-connector-python library, with partitioning by sentiment and location for query efficiency.
Excel Export: A Pandas DataFrame is exported to Excel using openpyxl, with conditional formatting to highlight key categories for stakeholders.

Challenges:

Schema Consistency: Ensuring LLM outputs adhered to the expected JSON format required robust error checking.
Stakeholder Accessibility: Excel exports needed clear documentation to bridge technical outputs with business needs.

5. Validation and Feedback Loop

To validate LLM classifications:

A random sample of ~1,000 comments across sentiment levels is manually reviewed.
Discrepancies (e.g., misclassified “staff behaviour” as “amenities”) are analysed to identify patterns.
A prompt adaptation mechanism dynamically rephrases inputs for low-confidence outputs (based on Llama’s internal probability scores), improving accuracy over iterations.

Technical Details:

Prompt Tuning: A Python script adjusts prompt phrasing (e.g., adding “focus on primary complaint” for Unhappy comments) based on validation feedback.
Challenges: Compound comments and sarcasm posed persistent challenges, requiring iterative prompt refinements.

6. Visualization of Insights

Processed data is aggregated by location and sentiment to generate bubble cloud charts using Python’s plotly. These visualizations highlight dominant issues (e.g., “cleanliness” for Unhappy comments) and are integrated into a Power BI dashboard for stakeholder access.

Technical Details:

Aggregation: Spark SQL queries group data by location and category, computing frequency and sentiment distributions.
Visualization: Plotly’s scatter function creates bubble charts, with bubble size proportional to issue frequency.
Challenges: Ensuring visualizations were intuitive for non-technical stakeholders required iterative design feedback.

7. Impact and Scalability

The workflow eliminates ~300 hours of manual tagging per quarter, enabling analysts to focus on strategic tasks. It also identifies novel pain points (e.g., booking system errors), driving targeted service improvements. Financially, enhanced customer retention and upselling opportunities are estimated to yield $10–15 million in annual revenue impact.

Scalability Features:

Horizontal Scaling: Databricks clusters scale dynamically with data volume.
API Optimization: Groq’s infrastructure supports up to 10,000 comments per hour with minimal latency.
Cost Efficiency: Batch processing and retry logic minimize API costs.

8. Future Enhancement

Future iterations could include:

In-House Model Fine-Tuning: Fine-tune a lightweight LLM (e.g., Qwen models) on domain-specific feedback to reduce API dependency.
Embedding-Based Analysis: Generate comment embeddings using Sentence-BERT for clustering and recommendation engines.
Real-Time Processing: Stream feedback via Apache Kafka for near-real-time insights.

Final Thoughts

This workflow exemplifies the power of integrating LLMs with traditional data engineering tools. By leveraging Snowflake’s scalability, Databricks’ processing power, and Groq’s inference speed, the system delivers actionable insights at scale. The approach is adaptable to other industries with high-volume text data, offering a blueprint for modern sentiment analysis.