Overview
Databricks — the unified data and AI platform founded by the creators of Apache Spark, valued at $43B as a private company.
When to Load This Skill
- User asks about Databricks history, Apache Spark, or data/AI platforms
- Need analysis of Databricks vs. Snowflake competition or Lakehouse architecture
- Questions about MosaicML acquisition, DBRX model, or the AI data infrastructure market
Historical Timeline
- 2013: Ali Ghodsi, Matei Zaharia (Spark creator), Ion Stoica found Databricks in Berkeley
- 2013: Open-sources Apache Spark — becomes the dominant big data processing engine
- 2019: Introduces Delta Lake — open-source storage layer bringing ACID transactions to data lakes
- 2021: Revenue passes $500M; valued at $38B
- 2023: Acquires MosaicML ($1.3B) — enters generative AI model training
- 2023: Introduces Lakehouse architecture — unifies data warehouse and data lake
- 2024: Launches DBRX (open-source LLM); valued at $43B; revenue ~$2B+
Business Model
Platform-as-a-Service: consumption-based pricing on Databricks Runtime (compute), storage (Delta Lake tables), and AI/ML services. Unity Catalog provides governance. Expanding from data engineering into BI, AI/ML, and governance.
Competitive Moat
- Apache Spark originators: deep technical authority and community influence
- Delta Lake ecosystem: open-source standard that competitors must support
- Lakehouse architecture: unifies data engineering, analytics, and AI — one platform instead of multiple tools
- MosaicML acquisition: vertical integration from data infrastructure to model training
- Open-source strategy: Spark, Delta Lake, MLflow create developer lock-in and community advocacy
Key Data
Valuation: $43B (private, 2024) | Revenue: ~$2B+ (2024) | Customers: 10,000+ | Spark users: 1M+ developers | Employees: ~7,000+
Interesting Facts
- Apache Spark was originally a class project at UC Berkeley's AMPLab — the paper was rejected from two conferences before it became the most popular big data framework
- Databricks is named after the fictional 'databrick' unit the founders jokingly used to measure Spark cluster processing capacity