How Databricks can Unlock Advanced Analytics for Every Business User

Hi, I am Michael Derrico, and I am currently entering my senior year at Fordham University's Gabelli School of Business, where I am pursuing a major in Finance with a concentration in Financial Technology. This post dives into a project I designed and worked on through Rearc's guidance and support, which sharpened my data skills and demonstrates how data analytics, when thoughtfully designed, doesn’t have to be complex to be effective.

Introduction

Understanding public sentiment around brands, products, and events in real-time has traditionally required specialized data teams, complex infrastructure, and weeks of development time. This project breaks down those barriers by demonstrating how platforms like Databricks, combined with AI tools, can transform anyone into a capable data analyst. What once demanded deep technical expertise–scraping multiple social media platforms, applying machine learning models, and generating executive summaries–now becomes as simple as typing a search term and clicking run.

By building a sentiment analysis pipeline that ingests data from X/Twitter, Reddit, and news sources, this solution showcases the true power of democratized analytics. The heavy lifting happens behind the scenes through Databricks’ unified workspace, Hugging Face’s pretrained models, and OpenAI’s natural language generation capabilities, while the end user experience remains remarkably simple: enter a topic/brand of interest and receive comprehensive, AI-powered insights within minutes. This represents a fundamental shift in how organizations can approach data-driven decision making–moving from exclusive, expert-dependent processes to inclusive, self-service analytics that empower every team member to extract meaningful insights from complex, real-time data streams.

Project Requirements:

Goal: Build a scalable, repeatable workflow that can...

Ingest and normalize unstructured data from Twitter, Reddit, and news outlets.
Score and summarize public sentiment using both rule-based and AI/ML techniques.
Consolidate results for queryable storage and dashboarding.
Generate AI-generated summaries tailored for business users and non-technical stakeholders.

What the workflow must deliver:

Channel-Agnostic Collection: Seamlessly gather posts, comments, and articles–regardless of platform limitations–so every relevant voice is captured in the same wide reaching scrape.
Built-in Deduplication Keep each post unique even when the same topic is queried repeatedly, ensuring insight numbers always add up.
One Source of Truth: Scraped, enriched content in a single, governed table; when marketing, support, and strategy teams compare findings, they see identical facts.
AI-Generated Summaries: Beyond retrieving rows of sentiment data, generate a summary of points of importance from the collected data, rather than performing the task manually where you require advanced analytics skills.

Why Databricks is the Ideal Platform for This Project

All Roles in One Workspace: Analysts, engineers, and business partners share the same Jupyter notebook folders and SQL dashboards, making results accessible and actionable for both data engineers and business users.
Open, Reliable Storage: Delta Lake keeps track of all edits performed to stored data and ensures full versioning through its transaction logs, enabling time travel and auditability.
AI SQL Functions & Connected Tools: Databricks’ built-in SQL function ai_analyze_sentiment and built-in connectors with Hugging Face machine learning models provide accurate sentiment analysis.
Widely Used Product: Trusted by global enterprises like Shell, Mastercard, and Microsoft, Databricks powers data and AI workflows across industries, ensuring scalability, reliability, and strong community support for years to come.

Having established why Databricks provides the ideal foundation for this project, let's examine how the sentiment tracking system actually works. The following workflow breakdown demonstrates how raw social media data transforms into actionable business insights.

End-to-End Workflow Breakdown

Live Data Feeds:
- Lobstr scrapes X/Twitter for social buzz and live chatter.
- PRAW scrapes threads from Reddit.
- GNews scrapes relevant articles.
  All three are retrieved within a Databricks notebook where Delta Lake stores the data in a consistent, query-friendly structure.
Instant Sentiment & Enrichment:
- Databricks’ ai_analyze_sentiment provides a sentiment label to each data point.
- A RoBERTa transformer from Hugging Face adds nuanced polarity scores, catching sarcasm, and mixed emotions, assigning a score to each data point of -1 to +1.
One-Click Summaries:
- Rows from the data form a text prompt which is made to OpenAI, distilling thousands of words into actionable and business-friendly insights, complete with a 1-10 sentiment score.
Friction-Free Delivery:
- Summaries can be easily pasted into a text box in a shared dashboard or directly via email so any colleague, technical or not, can see the same insight moments after the query runs.

Processes Performed in Order:

Workflow Breakdown

Approach & Key Decisions

My approach prioritized simplicity and accessibility over technical complexity, proving that sophisticated analytics can be intuitive for any user. Rather than building custom infrastructure for each data source, I leveraged built in and connected Databricks functions and tools to handle the complexity. A part of that complexity came from issues that would arise if there were any duplicates collected in the scraping process, which led to my decision to use the data point’s URL as a primary key when loading in new data. URL-based deduplication through Delta Lake’s MERGE operation means users never worry about data quality issues.

When first developing the code and idea as a whole, I realized the NLTK(natural language tool kit) python based, sentiment analysis tool I was using was not all too consistent in the scores it provided, but in general was just not accurate, so I knew if I wanted an accurate, numerical, sentiment analysis values, I would need to use a LLM(large language model). In the end, I decided on using a pre-trained sentiment model from Hugging Face, rather than the traditional way of building custom algorithms on a large enough dataset. Choosing the pre-trained model from Hugging Face not only saved me time in development, but further demonstrates how democratizing AI puts enterprise-grade material learning within reach of business analysts who have never written Python code. Finally, the schema standardization across Reddit, X/Twitter, and news sources happens automatically, while also saving the original forms, so users focus on data insights rather than wrestling with data formatting. This architectural philosophy–hiding complexity while preserving power–exemplifies how modern data platforms transform analytics from a specialized discipline into an everyday business capability.

Challenges Encountered:

Repeatable, Idempotent Runs: The pipeline was engineered to support unlimited re-runs with different queries or time windows, updating only new, unseen posts.
Tracking Queries & Freshness: Every record contains a query label and publish timestamp, enabling historic/backfill analysis or trend tracking.
Twitter API Constraints: Ongoing challenges include handling rate limitations and evolving scraping methods as Twitter’s APIs and policies change. The shift to Lobstr.io API was crucial as it is one of few Twitter wrappers which supports scraping for Twitter Search Results pages rather than the ability to scrape an individual post or user-profile. Additionally, my experience using the Lobstr.io API has been great and the support they’ve provided me with has been wonderful.
Unstructured → Structured: Standardizing unstructured text from vastly different platforms proved non-trivial, but was solved through schema design and validation.

Possible Next Steps

Future enhancements will further demonstrate how democratized analytics evolve with minimal technical overhead, making advanced capabilities accessible to even broader audiences. Implementing automated weekly runs allows the data and dashboards to stay up to date on public sentiment. Automated weekly runs could be implemented through setting up a schedule for the notebooks to run using a constant query through Databricks’ point-and-click interface, allowing marketing teams or product managers to set up brand tracking without data engineer involvement. A further enhancement that can be performed from there is setting up an alert and AI summary to notify the user when there is a dramatic shift in the sentiment. Finally, with more technical prowess, you can expand the data sources to platforms like YouTube or specified consumer reviews sites(like Consumer Reports, J.D. Power, and CNET) to perform more sophisticated product/brand sentiment analysis rather than general public sentiment. These incremental improvements underscore the project’s core thesis: as data tools become more intuitive and AI-assisted, the gap between having a question and finding an answer continues to shrink, ultimately putting sophisticated analytics capabilities into the hands of anyone curious enough to ask.

Latest Articles

Read more about the latest and greatest work Rearc has been up to.

Rearc-Cast: AI-Powered Content Intelligence System

Learn how we built Rearc-Cast, an automated system that creates podcasts, generates social summaries, and recommends relevant content.

AI & ML

Text-to-Speech

OpenAI

Talent Pipeline Analysis Project

Overview of the Talent Pipeline Analysis Rippling app

Data Analytics

Rippling

Bias Detection Program

An AI-powered bias detection program, which takes inputted PDFs and extracts the different forms of bias found (narrative, sentiment, slant, coverage depth, etc.).

LLM

ChatGPT

Python