Meetup Summary: From Traditional ETL to Governed AI on Databricks

February 21st, 2026

Introduction

The latest Northern Virginia Software Architecture Roundtable (NOVA SART) meetup, hosted by Solution Street, featured a deep dive into one of the most consequential shifts happening in enterprise data engineering today: the migration from traditional ETL platforms to modern, AI-ready architectures built on platforms such as Databricks. Hirenkumar Dholariya, a senior data engineering architect with over 18 years of experience modernizing enterprise data platforms for Fortune 500 organizations, walked attendees through both the foundational reasons this transition matters and a compelling real-world use case that brought the concepts to life: an AI-powered virtual assistant for field service engineers in the healthcare and life sciences industry.

The presentation made clear that the move away from legacy ETL isn’t just about better tooling. It’s about building a governed, unified data foundation that can support the AI workloads enterprises are now racing to deploy.

Why Traditional ETL Falls Short

Hiren opened by framing the challenge through the lens of the familiar “three Vs” of modern data: variety, velocity, and volume. Traditional ETL tools were designed for a world of structured, batch-oriented data. Nightly jobs loading text records into relational databases like Oracle or SQL Server. That world has changed dramatically. Today, the majority of enterprise data is unstructured (images, videos, JSON, PDFs) and organizations need near real-time processing, not overnight batch runs.

The limitations compound quickly. Running legacy ETL jobs more frequently to approximate real-time ingestion demands more resources and drives up costs without ever truly achieving the responsiveness modern applications require. Worse, traditional tools lack the governance capabilities enterprises now need: full audit trails, data lineage, fine-grained access controls, and automated compliance. As Hiren noted, even if you pour money into the old approach, trust in the data remains elusive because the tooling simply wasn’t designed to provide it.

He punctuated the point with a Gartner prediction: 80% of enterprises will adopt generative AI by 2026, but many will struggle to realize a return on investment because of persistent data integration gaps. The data, Hiren argued, is the foundation, and if that foundation is fragmented across siloed ETL jobs pulling from SAP, Salesforce, and dozens of other sources, building reliable AI on top of it becomes prohibitively expensive and slow.

Databricks: A Unified Platform for Data, Analytics, and AI

Databricks addresses these challenges by providing a single platform that spans data engineering, analytics, and AI. Rather than maintaining separate tools for ETL, BI, and machine learning, teams can operate within one environment that supports both batch and streaming workloads on the same data.

Hiren shared performance numbers from his own projects: jobs that previously ran on legacy ETL tools were running 10 to 40 times faster on Databricks, translating to more than 50% cost reduction on ETL pipeline operations alone. The platform’s auto-scaling cluster architecture, built on the Photon engine, means compute resources expand and contract with data volume automatically, so teams aren’t paying for idle capacity or scrambling when data spikes hit.

A key advantage is the open format approach. Data is stored as Delta Lake tables in Parquet format on cloud object storage (S3 in Hiren’s case), eliminating vendor lock-in. Unlike traditional architectures where data sits inside proprietary database engines with escalating license costs, the open format ensures portability and long-term cost control.

Unity Catalog: Governance at Scale

Central to Databricks’ architecture is Unity Catalog, the governance layer. Unity Catalog provides a three-tier organizational structure (catalog, schema, and tables/views) that maps naturally to how enterprises organize projects and data domains. It delivers several critical capabilities that traditional ETL environments lack.

First, it offers cross-cloud, cross-region governance with a single source of truth, meaning the same governance policies apply whether data lives in AWS, Azure, or Google Cloud. Second, it enables fine-grained security through row-level filters and column-level masking, going well beyond the table-level grants available in traditional databases. Third, Unity Catalog provides automatic data lineage: when a table is used in a pipeline or materialized view, those dependencies are tracked and visible without any manual documentation effort. And fourth, newer features like automated PII scanning and data classification can detect and mask sensitive fields such as social security numbers or phone numbers based on the data itself.

Hiren demonstrated this with examples from his project, showing how clicking on any table in Unity Catalog reveals its full dependency graph (upstream sources, downstream consumers, access permissions, modification history, and sample data) all in one place. For teams that have experienced the pain of someone unexpectedly modifying or deleting critical data, this level of visibility and auditability represents a fundamental improvement.

Real-World Implementation: An AI Virtual Assistant for Healthcare Field Service

The second half of the presentation showcased a concrete AI application built on the Databricks and Unity Catalog foundation: a service agent virtual assistant designed for a healthcare and life sciences company. In this industry, every minute of instrument downtime matters because it can directly or indirectly impact patient health.

The Problem: The company’s field service engineers resolve thousands of customer tickets per year for scientific and medical instruments deployed globally. To resolve a single ticket, an engineer often needed to spend two to three hours sifting through historical ticket data, service notes, part replacement records, and troubleshooting guides scattered across multiple systems. Hiren illustrated the cost of this inefficiency with a vivid scenario: a field service engineer travels 400 miles to a hospital site, only to discover that another engineer had already visited and replaced a filter the previous week, yet the underlying problem persists. The wasted trip costs the company in travel expenses and warranty labor while the customer’s frustration (and potentially patient impact) compounds.

The Solution Architecture: The team built an AI virtual assistant on top of Databricks and Unity Catalog. The architecture works as follows: data from multiple source systems (ticketing platforms, customer databases, cost tracking, and service histories) is ingested into an Enterprise Data Platform and organized within Unity Catalog. Materialized views are created to join and filter only the relevant fields across normalized tables (ticket descriptions, engineer comments, customer feedback, part costs, labor hours).

Because the company operates globally, ticket data arrives in many languages. The first processing step uses OpenAI libraries to translate all text into English. A summarization step then condenses multiple text fields per ticket into a single coherent summary. The summarized data is then vectorized using Databricks’ built-in Vector Search Index capability, which creates a vector database table within Unity Catalog. Approximately 400,000 historical tickets were processed this way, along with 140,000 PDF product manuals that were chunked, embedded, and stored in a separate vector table.

When a field service engineer encounters a problem, they open the assistant’s interface and type a natural language query, such as “pressure fluctuation.” The system vectorizes the query, performs a similarity search against the vector database, and returns relevant results within minutes rather than hours. The engineer can choose to search the knowledge base (PDF manuals), the ticket history, or both. Results from the ticket history are ranked by cost efficiency, meaning tickets where the issue was resolved with the fewest parts and least labor appear first. The engineer can drill into individual tickets for detailed step-by-step resolution notes, parts used, and labor costs, and can ask follow-up questions in natural language.

Critically, the system is designed to avoid hallucination. Because the data source is constrained to internal ticket history and product manuals, not the open web, the assistant only returns information that exists in the knowledge base. If no relevant history is found, it simply reports that rather than fabricating an answer.

Results: In roughly 14 months of production use, the assistant has delivered measurable impact. Ticket resolution time improved by 70%. Average labor costs per ticket are trending downward. Part recommendation accuracy is steadily increasing. Repeat site visits, where an engineer must return because the first visit didn’t resolve the issue, have dropped significantly. Over 1,500 field service engineers and more than 50 international customers are actively using the tool, with over 5,000 queries processed daily. The system also addresses institutional knowledge loss: when experienced engineers retire, their decades of accumulated troubleshooting knowledge is preserved in the system rather than walking out the door.

What’s Next: The team’s roadmap extends through three phases. Phase one (complete) is the reactive support tool described above. Phase two (in progress) adds predictive maintenance, using historical patterns to proactively recommend part replacements or warranty extensions during service visits, potentially saving hundreds of dollars per incident. Phase three expands into enterprise marketing support, enabling sales teams to identify upselling opportunities based on which customers are using older instrument models and how much they’ve spent on maintenance.

Five Key Takeaways

1. AI starts with the data foundation, not the model. The most important message from Hiren’s talk was that AI applications are only as good as the data foundation underneath them. Without governed, unified, high-quality data, even the most sophisticated models will underperform. Teams should invest in their data architecture before racing to build AI features.

2. Legacy ETL tools are a bottleneck for AI adoption. Traditional tools were built for a different era. They struggle with unstructured data, can’t support real-time processing cost-effectively, and lack the governance features modern AI workloads demand. Migrating to a platform like Databricks isn’t just a modernization exercise; it’s a prerequisite for meaningful AI deployment.

3. Governance isn’t optional; it’s what makes AI trustworthy. Unity Catalog’s fine-grained security, automatic lineage tracking, and PII detection aren’t just nice-to-have compliance features. In regulated industries like healthcare and life sciences, they’re essential. And even outside regulated environments, data trust is what separates AI experiments from production systems.

4. Constraining your AI’s data source is the most effective way to prevent hallucination. By limiting the assistant’s knowledge to internal ticket history and product manuals, rather than the open web, the team achieved high accuracy without complex guardrailing. This pattern of RAG over curated internal data is broadly applicable.

5. AI dramatically compresses time-to-value when built on the right platform. The initial data processing (translating, summarizing, and vectorizing 400,000 tickets and 140,000 PDFs) took 40+ hours as a one-time effort. After that, incremental updates run automatically and are fast. A tool that would have been unthinkable five years ago is now in daily production use by over 1,500 engineers, delivering 70% faster resolution times. The economics of building AI solutions have fundamentally shifted.

Solution Street is a Software Engineering and Consulting company. Reach out today to learn how AI can streamline your operations and deliver measurable efficiency gains!