PodcastsTechnologyThe Data Flowcast: Mastering Apache Airflow ® for Data Engineering and AI

The Data Flowcast: Mastering Apache Airflow ® for Data Engineering and AI

Astronomer
The Data Flowcast: Mastering Apache Airflow ® for Data Engineering and AI
Latest episode

98 episodes

  • The Data Flowcast: Mastering Apache Airflow ® for Data Engineering and AI

    Introducing Airflow 3.2

    09/04/2026 | 26 mins.
    We introduce Airflow 3.2 and its updates for teams that build and operate data pipelines.
    Astronomer’s Head of Customer Education, Marc Lamberti, and Senior Manager of Developer Relations, Kenten Danas, break down what’s new, from asset partitioning to Async Python tasks and DAG versioning. They explore how these updates improve scheduling, performance and observability in production workflows.

    Key Takeaways:

    00:00 Introduction.
    02:10 Airflow 3 architecture separates workers from the metadata database.
    03:05 Plugin versioning and UI-based backfills simplify operations.
    06:20 Asset partitioning enables granular, partition-level scheduling.
    07:15 Triggering DAGs on partitions instead of full datasets.
    11:05 Deferrable operators reduce worker slot usage.
    12:00 Async operators reduce database pressure and overhead.
    14:10 Async improves throughput, not single task speed.
    22:20 Inlets and outlets improve asset lineage visibility.
    23:00 DAG version markers show changes directly in the UI.

    Resources Mentioned:

    Marc Lamberti
    https://www.linkedin.com/in/marclamberti/

    Apache Airflow
    https://airflow.apache.org/

    Astronomer | LinkedIn
    https://www.linkedin.com/company/astronomer/

    Astronomer | Website
    https://www.astronomer.io/

    3.2 Webinar
    https://www.astronomer.io/events/webinars/introducing-airflow-3-2-video

    Asset Partitioning Guide
    https://www.astronomer.io/docs/learn/airflow-partitioned-runs

    Asynchronous Processes Guide
    https://www.astronomer.io/docs/learn/deferrable-operators

    Release Notes
    https://airflow.apache.org/docs/apache-airflow/stable/release_notes.html#airflow-3-2-0-2026-04-07

    Provider Registry
    https://airflow.apache.org/registry/

    Thanks for listening to “The Data Flowcast: Mastering Apache Airflow® for Data Engineering and AI.” If you enjoyed this episode, please leave a 5-star review to help get the word out about the show. And be sure to subscribe so you never miss any of the insightful conversations.

    #AI #Automation #Airflow #MachineLearning
  • The Data Flowcast: Mastering Apache Airflow ® for Data Engineering and AI

    Reflections on a Decade of Data Engineering at Seattle Data Guy

    03/04/2026 | 26 mins.
    Lessons from the past decade of data engineering reveal how much the ecosystem has changed and what has stayed surprisingly consistent.

    In this episode, Benjamin Rogojan, Owner and Data Consultant at Seattle Data Guy, joins us to reflect on how the data engineering landscape has evolved alongside Apache Airflow. We explore when Airflow makes sense as an orchestrator, why batch processing is still dominant and how AI is reshaping the workflows and responsibilities of modern data engineers.

    Key Takeaways:

    00:00 Introduction.
    03:00 Airflow becomes valuable when workflows involve many pipelines, teams and dependencies.
    05:00 Data engineers are still focused on making data accessible and aligning work with business needs.
    05:30 Batch pipelines remain the most common approach even as real-time use cases grow.
    07:45 Many “real-time” requests are actually event-driven batch workflows.
    09:00 Airflow replaced many custom-built pipeline systems with built-in dependency management.
    11:00 Modern orchestration tools often build on Airflow concepts or differentiate from them.
    14:00 AI can assist with writing SQL and pipelines but still requires experienced engineers.
    15:30 Organizations are collecting increasingly granular data creating more engineering demand.
    19:00 The data stack has shifted rapidly from Hadoop-era systems to modern cloud platforms.

    Resources Mentioned:

    Benjamin Rogojan
    https://www.linkedin.com/in/benjaminrogojan/

    Seattle Data Guy
    https://www.linkedin.com/company/seattle-data-guy/

    Apache Airflow
    https://airflow.apache.org

    Airflow Summit / Airflow Conference
    https://airflowsummit.org

    Snowflake
    https://www.snowflake.com

    HubSpot Data Sharing / APIs
    https://developers.hubspot.com

    MLflow
    https://mlflow.org

    Thanks for listening to “The Data Flowcast: Mastering Apache Airflow® for Data Engineering and AI.” If you enjoyed this episode, please leave a 5-star review to help get the word out about the show. And be sure to subscribe so you never miss any of the insightful conversations.

    #AI #Automation #Airflow
  • The Data Flowcast: Mastering Apache Airflow ® for Data Engineering and AI

    Managing Data Quality and Governance With Airflow at Credit Karma with Ashir Alam

    26/03/2026 | 22 mins.
    Data quality is not optional when you manage credit data at scale.

    In this episode, Ashir Alam, Senior Data Engineer at Credit Karma, joins us to share how his team acts as the gatekeeper for credit data ingestion, how they standardize data quality with Airflow and DAG Factory and how they scale safely across thousands of DAGs. We explore how governance, PII protection and orchestration come together inside a modern data platform.
    
    Key Takeaways:

    00:00 Introduction.
    01:00 Overview of Credit Karma’s products and financial data ecosystem.
    02:00 The team acts as gatekeepers for ingesting data from TransUnion and Equifax.
    03:00 Why PII handling and controlled downstream access led to adopting Airflow.
    04:00 BigQuery as the warehouse and Airflow as the primary orchestrator.
    05:00 Why data quality and governance are critical in financial systems.
    07:00 Why Airflow was selected: ease of use and unified ETL plus data quality.
    09:00 Introduction to DAG Factory and YAML-based DAG generation.
    10:00 GitHub executor creates PR-driven DAG workflows with CI checks.
    12:00 BigQuery operators, structured checks and custom Slack and PagerDuty alerts.
    13:00 Failed checks stop ETL pipelines and trigger notifications.
    17:00 Scaling DAG Factory across thousands of DAGs and runtime vs compile-time concerns.
    19:00 Future improvements: better defaults, retries and GenAI workflows in Airflow.

    Resources Mentioned:

    Ashir Alam
    https://www.linkedin.com/in/ashir-alam/

    Credit Karma
    https://www.linkedin.com/company/intuit-credit-karma/

    Apache Airflow
    https://airflow.apache.org/

    DAG Factory
    https://github.com/astronomer/dag-factory

    BigQuery (Google Cloud)
    https://cloud.google.com/bigquery

    GitHub
    https://github.com/

    Slack
    https://slack.com/

    PagerDuty
    https://www.pagerduty.com/

    Thanks for listening to “The Data Flowcast: Mastering Apache Airflow® for Data Engineering and AI.” If you enjoyed this episode, please leave a 5-star review to help get the word out about the show. And be sure to subscribe so you never miss any of the insightful conversations.

    #AI #Automation #Airflow
  • The Data Flowcast: Mastering Apache Airflow ® for Data Engineering and AI

    Open Source Airflow Contributions and Performance Improvements at G-Research with Christos Bisias

    19/03/2026 | 17 mins.
    Modern Airflow isn’t just orchestration. It's a contribution.
    
    In this episode, we explore how open source investment drives real performance gains and deeper observability.

    We’re joined by Christos Bisias, Open Source Software Engineer, Apache Airflow at G-Research, to discuss how his team uses Airflow for large-scale data transformations, contributes upstream and improves scheduler throughput and OpenTelemetry support. From trace-level observability to CI-enforced metrics governance and a major scheduler optimization, this conversation spans strategy, engineering and community impact.

    Key Takeaways:

    00:00 Introduction.
    01:20 How G-Research applies machine learning and big data to predict financial market movements.
    02:15 Contributing to open source is a business decision.
    03:10 Maintaining a fork is costly.
    04:30 OpenTelemetry collects metrics, logs and traces to provide deep system visibility.
    06:10 Custom spans help identify bottlenecks inside tasks and enable performance optimization.
    08:05 OpenTelemetry integration works properly in Airflow 3.0 and above.
    10:00 A YAML-based metrics registry with CI enforcement ensures consistency between docs and exported metrics.
    12:10 Scheduler throughput improved significantly by applying concurrency limits earlier in the database query.
    15:20 Future Task SDK changes may enable language-agnostic DAG authoring beyond Python.

    Resources Mentioned:

    Christos Bisias
    https://www.linkedin.com/in/xbis/

    G-Research
    https://www.linkedin.com/company/g-research/

    Apache Airflow
    https://airflow.apache.org/

    OpenTelemetry
    https://opentelemetry.io/

    Prometheus
    https://prometheus.io/

    Grafana
    https://grafana.com/

    Jaeger
    https://www.jaegertracing.io/

    Thanks for listening to “The Data Flowcast: Mastering Apache Airflow® for Data Engineering and AI.” If you enjoyed this episode, please leave a 5-star review to help get the word out about the show. And be sure to subscribe so you never miss any of the insightful conversations.

    #AI #Automation #Airflow
  • The Data Flowcast: Mastering Apache Airflow ® for Data Engineering and AI

    Automating Threat Intelligence Using Airflow with Karan Alang

    12/03/2026 | 22 mins.
    In this episode, Karan Alang, Principal Software Engineer at Versa Networks, joins the conversation to discuss how Airflow can be used to automate threat intelligence in modern cybersecurity environments. He explains the growing scale of cloud computing, the profitability of hacking and the shortage of SOC analysts. Karan also outlines a novel architecture that combines Airflow, XDR, graph databases and LLMs to orchestrate automated threat detection and response.

    Key Takeaways:

    00:00 Introduction.
    05:00 Organizations face massive log volumes and a shortage of SOC analysts.
    07:00 The solution integrates Airflow, XDR, Neo4j graph databases and LLMs into one architecture.
    08:00 MITRE ATT&CK provides a global framework for mapping tactics and techniques.
    11:00 Airflow acts as the orchestration backbone for ingestion graph transformation and LLM workflows.
    13:00 Graph databases provide a full relationship view of attackers’ systems and entities.
    14:00 LLMs automate mapping activity to MITRE ATT&CK and assign explainable risk scores.
    17:00 Traditional signature-based detection allows lateral movement and exfiltration before teams can react.
    18:00 End-to-end automation is essential to mitigating modern cybersecurity threats.
    20:00 Future opportunities include deeper LLM integration as first-class citizens within Airflow.

    Resources Mentioned:

    Karan Alang
    https://www.linkedin.com/in/karan-alang-4173437

    Versa Networks | LinkedIn
    https://www.linkedin.com/company/versa-networks

    Versa Networks | Website
    https://versa-networks.com

    Google Cloud Composer (Managed Airflow on GCP)
    https://cloud.google.com/composer

    Microsoft Defender XDR
    https://www.microsoft.com/es-es/security/business/siem-and-xdr/microsoft-defender-xdr

    Neo4j (Graph Database)
    https://neo4j.com

    MITRE ATT&CK Framework
    https://attack.mitre.org

    Thanks for listening to “The Data Flowcast: Mastering Apache Airflow® for Data Engineering and AI.” If you enjoyed this episode, please leave a 5-star review to help get the word out about the show. And be sure to subscribe so you never miss any of the insightful conversations.

    #AI #Automation #Airflow #MachineLearning

More Technology podcasts

About The Data Flowcast: Mastering Apache Airflow ® for Data Engineering and AI

Welcome to The Data Flowcast: Mastering Apache Airflow ® for Data Engineering and AI— the podcast where we keep you up to date with insights and ideas propelling the Airflow community forward. Join us each week, as we explore the current state, future and potential of Airflow with leading thinkers in the community, and discover how best to leverage this workflow management system to meet the ever-evolving needs of data engineering and AI ecosystems. Podcast Webpage: https://www.astronomer.io/podcast/
Podcast website

Listen to The Data Flowcast: Mastering Apache Airflow ® for Data Engineering and AI, Hard Fork and many other podcasts from around the world with the radio.net app

Get the free radio.net app

  • Stations and podcasts to bookmark
  • Stream via Wi-Fi or Bluetooth
  • Supports Carplay & Android Auto
  • Many other app features

The Data Flowcast: Mastering Apache Airflow ® for Data Engineering and AI: Podcasts in Family