Here are 100 data engineering interview questions and answers, covering ETL, data warehousing, data modeling, SQL, big data technologies (Spark, Hadoop, Kafka), cloud data services (AWS, GCP, Azure), orchestration (Airflow), data quality, and architecture patterns. Each question is in bold, followed by a detailed answer. No dividing lines.
What is data engineering and how does it differ from data science?
Answer: Data engineering focuses on building and maintaining data pipelines, architectures, and infrastructure for collecting, storing, and processing data. Data science uses that data to build models, run analyses, and generate insights. Data engineers ensure data is reliable, accessible, and scalable; data scientists extract value from it.
What is an ETL pipeline? Explain the three phases.
Answer: ETL stands for Extract, Transform, Load. Extract – pulling raw data from source systems (databases, APIs, logs). Transform – cleaning, validating, aggregating, enriching, and restructuring data (e.g., joining tables, handling nulls). Load – writing transformed data into a target system (data warehouse, data lake). Modern ELT (Extract, Load, Transform) moves transformation to the target (e.g., in‑warehouse SQL).
What is the difference between a data lake and a data warehouse?
Answer: A data warehouse stores structured, processed data optimized for analytics and BI (schema‑on‑write). A data lake stores raw, unprocessed data (structured, semi‑structured, unstructured) in its native format (schema‑on‑read). Data lakes are cheaper and more flexible; data warehouses offer better performance and governance for structured queries.
What is a slowly changing dimension (SCD)? Name the common types.
Answer: SCDs track changes in dimension attributes over time. Type 0 (fixed – no changes), Type 1 (overwrite – no history), Type 2 (add new row with versioning/effective dates – full history), Type 3 (add a column – limited history, e.g., previous value), Type 4 (history table separate), Type 6 (hybrid of 1,2,3).
Explain the difference between star schema and snowflake schema.
Answer: Star schema has a central fact table connected directly to dimension tables (denormalized dimensions). Snowflake schema normalizes dimensions into multiple related tables, reducing redundancy but increasing joins. Star schema is simpler and faster for queries; snowflake saves storage but can be slower.
What is a fact table and a dimension table?
Answer: A fact table stores quantitative, measurable data (metrics) about business events (e.g., sales amount, quantity sold). It contains foreign keys to dimension tables. A dimension table stores descriptive attributes (e.g., product name, customer address, date). Facts are additive; dimensions are descriptive.
What is data normalization? When would you denormalize?
Answer: Normalization organizes data to reduce redundancy and improve integrity (e.g., 1NF, 2NF, 3NF). Denormalization intentionally adds redundancy to improve read performance, often used in data warehouses (star schemas). Denormalize when query performance is critical and writes are less frequent.
What are ACID properties? Why are they important?
Answer: Atomicity (all or nothing), Consistency (valid state before and after), Isolation (concurrent transactions don’t interfere), Durability (committed changes persist). ACID ensures reliable database transactions, crucial for operational systems. Data warehouses may relax isolation for performance (eventual consistency).
What is idempotency in data pipelines and why is it important?
Answer: Idempotency means running the same operation multiple times produces the same result as running it once. In data pipelines, it allows safe retries without duplicates or corruption. Achieved by using upsert (merge) logic, deduplication keys, or checkpoints.
Explain the difference between batch processing and stream processing.
Answer: Batch processing processes data in large, bounded chunks at scheduled intervals (e.g., daily ETL). Stream processing processes unbounded, real‑time data records continuously (e.g., Kafka, Flink). Batch is simpler and cost‑effective; stream provides low latency.
What is Apache Spark? How does it differ from Hadoop MapReduce?
Answer: Spark is a unified, in‑memory distributed processing engine for big data. Hadoop MapReduce writes intermediate results to disk, causing higher latency. Spark uses in‑memory caching and optimized execution (DAG) for up to 100x faster performance for iterative and interactive workloads. Supports batch, streaming, SQL, ML, graph.
What are RDDs, DataFrames, and Datasets in Spark?
Answer: RDD (Resilient Distributed Dataset) is the low‑level API, type‑safe but slower. DataFrame is a distributed table with row optimization and Catalyst optimizer (similar to pandas DataFrame). Dataset is a type‑safe, object‑oriented API (strongly typed). DataFrames are recommended for most use cases.
What is Spark lazy evaluation?
Answer: Transformations in Spark (e.g., map, filter) are not executed immediately – they build a DAG (directed acyclic graph) of operations. Execution only happens when an action (e.g., count, collect, write) is called, allowing optimization like predicate pushdown and pipeline fusion.
Explain shuffle in Spark. Why is it expensive?
Answer: Shuffle redistributes data across partitions during operations like groupBy, join, or repartition. It involves disk I/O, network transfer, and serialization, making it expensive. Minimize shuffles by using broadcast joins for small tables, bucketing, or using aggregations before joins.
What is a Spark cluster mode (client vs cluster)?
Answer: Client mode: driver runs on the machine submitting the job (outside the cluster). Cluster mode: driver runs inside the cluster (on a worker). Cluster mode is preferred for production because it isolates driver failures and better uses resources.
What is Apache Kafka? Explain its core components.
Answer: Kafka is a distributed event streaming platform. Components: Producer (publishes messages), Topic (logical channel), Partition (ordered, immutable sequence of messages), Consumer (reads messages), Broker (Kafka server), Consumer Group (multiple consumers sharing load). Messages are retained for a configurable period.
What is a Kafka offset and how does a consumer commit it?
Answer: Offset is a unique ID per message within a partition. Consumers commit offsets after processing to mark their position. Committing can be automatic (enable.auto.commit) or manual (synchronous/asynchronous) using commitSync/commitAsync. At‑least‑once delivery requires commit after processing.
What is the difference between at‑most‑once, at‑least‑once, and exactly‑once delivery semantics?
Answer: At‑most‑once: messages may be lost but never duplicated. At‑least‑once: messages are never lost but may be duplicated (default in many systems). Exactly‑once: each message is delivered exactly once (requires idempotent producer and transactional consumer). Kafka supports exactly‑once with idempotence and transactions.
What is a data warehouse and what are its key characteristics?
Answer: A data warehouse is a centralized repository for integrated, historical data optimized for query and analysis. Characteristics: subject‑oriented (organized by business subjects), integrated (consistent naming, formats), non‑volatile (data is not updated, only inserted), time‑variant (tracks changes over time).
What is a data mart?
Answer: A data mart is a subset of a data warehouse focused on a specific business line (e.g., sales, finance). It is smaller and tailored to the needs of one department. Data marts can be built from the data warehouse or sourced directly.
What is an OLAP cube?
Answer: An Online Analytical Processing (OLAP) cube pre‑computes aggregates (sums, averages) across multiple dimensions, allowing fast drill‑down and roll‑up queries. Modern SQL engines (BigQuery, Redshift) often query raw tables with optimizations, reducing need for physical cubes.
What is the difference between row‑oriented and column‑oriented storage?
Answer: Row‑oriented stores entire rows together (best for transactional OLTP). Column‑oriented stores columns separately (best for analytical queries that aggregate or filter on few columns). Columnar storage reduces I/O and improves compression. Examples: Parquet, ORC, BigQuery, Redshift.
What is partitioning in data storage? Why is it used?
Answer: Partitioning splits a table into smaller physical segments based on a column (e.g., date). It improves query performance (partition pruning) and manageability (drop old partitions). In Spark, partitioned writes create directory hierarchies; in BigQuery, partitioned tables limit scan cost.
What is bucketing/clustering?
Answer: Bucketing (Hive/Spark) divides data into a fixed number of files based on hash of a column, improving join performance (bucketed joins avoid shuffle). Clustering (BigQuery/Delta) sorts data within partitions, allowing data skipping. Both optimize certain queries.
What is a File Format? Compare Parquet, Avro, and CSV.
Answer: Parquet (columnar, highly compressed, schema evolution, good for analytics). Avro (row‑based, binary, schema‑embedded, good for streaming and serialization). CSV (plain text, slow, no schema, poor compression). Parquet is preferred for data lakes and analytical workloads.
What is schema evolution? How does it work in Parquet and Avro?
Answer: Schema evolution allows adding, removing, or modifying fields without breaking compatibility. Parquet supports adding nullable fields, but removing or renaming is problematic. Avro supports full schema evolution using writer/reader schemas; field defaults handle missing values.
What is a distributed file system (e.g., HDFS, GCS)?
Answer: A distributed file system stores large files across multiple machines with replication for fault tolerance. HDFS (Hadoop Distributed File System) uses NameNode + DataNodes. Cloud object stores (GCS, S3) are not traditional filesystems but are used similarly (eventual consistency, no random writes).
What is Apache Airflow? What are its components?
Answer: Airflow is a workflow orchestration platform for authoring, scheduling, and monitoring data pipelines. Components: DAG (Directed Acyclic Graph of tasks), Operator (executes a task – PythonOperator, BashOperator), Task, Sensor (waits for a condition), Scheduler (triggers tasks), Web UI, Executor (runs tasks – Local, Celery, Kubernetes).
Explain DAG in Airflow.
Answer: A DAG (Directed Acyclic Graph) is a collection of tasks with defined dependencies and no cycles. Airflow runs DAGs at scheduled intervals. Each DAG has a schedule (cron) and start date. Tasks can have dependencies using >>, <<, or set_upstream/set_downstream.
What are dynamic tasks in Airflow (task mapping)?
Answer: Task mapping (Airflow 2.3+) allows generating multiple tasks at runtime based on a list, using expand() or partial() + expand(). Reduces code duplication when processing variable‑length inputs.
How do you handle backfilling in Airflow?
Answer: Backfilling runs a DAG for past dates using command line: airflow dags backfill -s start_date -e end_date dag_id. Configure catchup=True (default) to automatically run missed DAG runs. Use max_active_runs to limit concurrency.
What is idempotency in Airflow tasks?
Answer: Idempotency ensures a task can be retried without double‑processing. Achieve by using upsert (merge) logic, writing to a staging table then atomic replace, or using deterministic partition overwrites. Airflow tasks should be designed to be re‑runnable.
What is the difference between an Operator and a Sensor in Airflow?
Answer: Operator executes a single task (e.g., PythonOperator, BashOperator, SQLExecuteQueryOperator). Sensor waits for a condition (e.g., file arrival, external pipeline completion) and succeeds once condition met; it has a poke interval and timeout.
What is XCom in Airflow?
Answer: XCom (cross‑communication) allows tasks to exchange small amounts of data (JSON‑serializable). Use task_instance.xcom_push() and task_instance.xcom_pull(). Not for large data (use external storage like GCS). XComs are stored in Airflow metadata database.
What is Apache Beam? How does it relate to Dataflow?
Answer: Apache Beam is a unified programming model for batch and streaming pipelines (Java, Python). It defines a pipeline that runs on different runners. Google Cloud Dataflow is a fully managed Beam runner (auto‑scaling). Other runners: Spark, Flink, DirectRunner.
What is event time vs processing time in streaming?
Answer: Event time is when the event actually occurred (timestamp in data). Processing time is when the system processes the event. Streaming systems must handle out‑of‑order events (watermarks, allowed lateness). Event time is crucial for correct time‑based analytics.
What is a watermark in streaming (e.g., Flink, Beam)?
Answer: A watermark is a mechanism that signals to the streaming engine that events up to a certain timestamp have been observed. It allows handling late data. Once watermarks pass a certain point, window results can be emitted. Late data after watermark can be handled with allowed lateness.
What is the difference between stateful and stateless stream processing?
Answer: Stateless processing applies transformations that don’t depend on previous events (e.g., map, filter). Stateful processing maintains state across events (e.g., aggregations, joins, session windows). State is checkpointed for fault tolerance.
What is Apache Flink? How does it differ from Spark Streaming?
Answer: Flink is a true streaming engine (event‑at‑a‑time) with low latency, native state management, and exactly‑once semantics. Spark Streaming uses micro‑batches (bounded intervals), which adds latency but simplifies fault tolerance. Flink is better for sub‑second latency; Spark is simpler for batch-integrated workloads.
What is a data pipeline SLA?
Answer: Service Level Agreement (SLA) defines expected pipeline performance: data freshness (latency), availability, reliability (error rate). Example: “95% of events processed within 1 minute, 99.9% uptime”. Monitor with metrics (lag, throughput) and alert on breaches.
What is data lineage and why is it important?
Answer: Data lineage tracks the origin, transformation, and movement of data across systems. It helps with debugging, impact analysis, regulatory compliance (GDPR, CCPA), and data quality. Tools: OpenLineage, Marquez, or custom metadata tracking.
What is a data quality check? Name common types.
Answer: Data quality checks validate: completeness (non‑null), uniqueness (no duplicates), consistency (format, referential integrity), accuracy (range, valid values), timeliness (freshness). Implement as SQL queries in pipeline (dbt tests, Great Expectations, Deequ).
What is dbt (data build tool)?
Answer: dbt is a transformation tool that enables analysts and engineers to write modular SQL transformations (models) with version control, testing, and documentation. It compiles SQL to tables/views. dbt Cloud or open source runs against data warehouses (Snowflake, BigQuery, Redshift).
What is a Snowflake? Name its key features (the cloud data warehouse).
Answer: Snowflake is a cloud data warehouse with separation of storage and compute (virtual warehouses). Features: automatic scaling, zero‑copy cloning, time travel (historical data), semi‑structured data support (VARIANT), and cross‑cloud (AWS, Azure, GCP). Pay per second.
What is a Delta Lake?
Answer: Delta Lake is an open format storage layer on top of Parquet that provides ACID transactions, time travel (versioning), schema enforcement, and unified batch/streaming. It works with Spark and other engines. Lakehouse architecture.
What is Apache Iceberg? How does it differ from Delta Lake?
Answer: Iceberg is another open table format for large‑scale data lakes. It supports hidden partitioning, schema evolution, and time travel. Unlike Delta Lake (tightly coupled with Spark), Iceberg has broader engine support (Trino, Flink, Spark, Hive). Both solve similar problems.
What is a streaming analytics system (e.g., Kafka Streams, KSQL)?
Answer: Kafka Streams is a Java library for building real‑time stream processing applications on top of Kafka (no separate cluster). KSQL is the SQL interface for Kafka Streams. They allow transformations, aggregations, joins on event streams.
What is a Change Data Capture (CDC)? Name common tools.
Answer: CDC captures changes (insert, update, delete) from a source database as they happen. Tools: Debezium (open source, uses Kafka Connectors), AWS DMS, Oracle GoldenGate, Striim. CDC enables near‑real‑time replication to data lakes/warehouses.
What is a data vault (Data Vault 2.0) modeling?
Answer: Data Vault is a modeling methodology designed for agile, scalable data warehouses. It consists of hubs (business keys), links (relationships), and satellites (descriptive attributes). It is audit‑able, handles change over time, and is independent of source system structure.
What is an operational data store (ODS)?
Answer: An ODS is a database for short‑term, near‑real‑time operational reporting. It stores current data (not historical) and is loaded incrementally. Used for dashboards that need up‑to‑the‑minute data before the data warehouse updates.
What is a Lakehouse architecture?
Answer: Lakehouse combines data lake flexibility (store all data cheaply) with data warehouse features (ACID, schema enforcement, indexing, time travel). Implemented with Delta Lake, Iceberg, or Hudi. Query engines like Spark, Presto, or Redshift Spectrum.
What is Apache Hudi?
Answer: Hudi (Hadoop Upserts Deletes Incremental) provides record‑level upserts and deletes on data lakes, with incremental queries and time travel. Helps streaming ingest and transaction support.
What is a data catalog? Name examples.
Answer: A data catalog is metadata management service that indexes and organizes data assets (tables, columns, documentation). Examples: Google Data Catalog, AWS Glue Data Catalog, Apache Atlas, Amundsen, Collibra. Supports data discovery, governance.
What is infrastructure as code (IaC) for data pipelines?
Answer: IaC defines cloud infrastructure (e.g., Cloud Storage buckets, BigQuery datasets, Dataproc clusters) using code (Terraform, CloudFormation, Pulumi). Enables version control, reproducibility, and automated provisioning of data stack.
What is schema‑on‑read vs schema‑on‑write?
Answer: Schema‑on‑write enforces a schema before writing data (data warehouse). Schema‑on‑read applies schema when reading (data lake), allowing raw data storage and flexible interpretation. Each has trade‑offs: performance vs flexibility.
What is predicate pushdown?
Answer: Predicate pushdown pushes filtering (WHERE clause) down to the storage layer, reducing data scanned. Supported by columnar formats (Parquet) and query engines (Spark, Trino). Essential for performance and cost (e.g., BigQuery partitioning.
What is data partitioning in Spark? How does it affect performance?
Answer: Spark splits data into partitions (default 200 for shuffle). Too few partitions under‑utilizes cluster; too many causes scheduler overhead. Use repartition (full shuffle) or coalesce (reduce partitions without shuffle) to tune.
What are broadcast joins in Spark?
Answer: A broadcast join sends small table (under default 10 MB) to all executors, avoiding costly shuffle. Use broadcast(df) hint or spark.sql.autoBroadcastJoinThreshold. Efficient for dimension tables.
What is skew join?
Answer: Skew join occurs when one key has many more records than others, causing a partition to be overloaded. Solutions: salting (add random prefix), isolate skewed keys, or use Spark adaptive query execution (AQE) which handles skew since 3.0.
What is Adaptive Query Execution (AQE) in Spark?
Answer: AQE (Spark 3.0+) dynamically optimizes query plans at runtime based on statistics: coalescing post‑shuffle partitions, switching join strategies (broadcast vs sort‑merge), and handling skew joins. Improves performance without manual tuning.
What is a window function in SQL? Give an example.
Answer: Window functions perform calculations across rows related to current row without collapsing output. Example: ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC). Used for ranking, running totals, moving averages.
How do you deduplicate rows in SQL?
Answer: Use ROW_NUMBER() with PARTITION BY duplicate key, ORDER BY (e.g., timestamp for latest), then filter WHERE rn = 1. Also DISTINCT (if all columns duplicate), or GROUP BY.
What is the difference between UNION and UNION ALL?
Answer: UNION removes duplicate rows (performs sort/distinct). UNION ALL includes all rows, including duplicates, and is faster. Use UNION ALL unless distinct elimination required.
What is an index? When are they used in analytical workloads?
Answer: Indexes speed up row lookups but add overhead for writes. In analytical databases (warehouses), indexes are less common; clustering and partitioning replace them. Some systems (Redshift sort keys, Snowflake clustering keys) provide similar benefits.
What is materialized view? When would you use it?
Answer: A materialized view stores query results physically, refreshing periodically or incrementally. Use for pre‑computed aggregates or complex joins to speed up repeated queries. BigQuery, Snowflake, and Postgres support them.
What is incremental load vs full load?
Answer: Incremental load transfers only new/changed data since last load (using timestamp, watermark, CDC). Full load loads entire dataset each time. Incremental reduces data volume and pipeline duration, but requires handling updates/deletes.
How do you handle schema changes from source systems in a pipeline?
Answer: Options: ignore new columns (default), evolve schema by adding columns, or fail the pipeline (for breaking changes). Use schema‑on‑read (data lake), schema registry (Avro), or store raw JSON and parse later. Automation: detect changes and alert.
What is a data contract?
Answer: A data contract is an agreement between data producers and consumers specifying schema, quality expectations (nullability, uniqueness), and semantics. Implemented in schema registries, dbt tests, or policy engines. Improves reliability.
What is data observability? Name key pillars.
Answer: Data observability is the ability to understand data health in pipelines. Pillars: freshness (data timeliness), distribution (volume, anomalies), volume, schema (drift), lineage, and quality. Tools: Monte Carlo, BigEye, Soda.
What is data freshness metric?
Answer: Freshness measures how recent the data is compared to expected arrival time (e.g., last update timestamp). Alert if pipeline lag exceeds threshold (e.g., 2 hours). Implemented via monitoring last events.
What is backpressure in streaming?
Answer: Backpressure occurs when a downstream system (e.g., sink, consumer) cannot keep up with incoming rate. Streaming engines handle it by buffering, throttling, or dropping data. Kafka consumers use max.poll.records and pause.
What is a dead letter queue (DLQ) in data pipelines?
Answer: A DLQ stores messages that cannot be processed (e.g., malformed JSON, validation failure) for later inspection. Used to avoid blocking the main pipeline. Kafka, Pub/Sub, and Dataflow support DLQs.
What is a data retention policy?
Answer: Defines how long data is kept in storage (e.g., raw data 30 days, aggregated data 1 year). Implemented via bucket lifecycle rules, table expiration, or scheduled deletions. Compliance, cost, and storage constraints drive policies.
What is data partitioning in BigQuery?
Answer: BigQuery partitioned tables split data by a DATE, TIMESTAMP, or INTEGER column (ingestion time or column). Queries that filter on partition column reduce cost and improve speed. Partition expiration auto‑deletes old partitions.
What is BigQuery clustering?
Answer: Clustering sorts data within each partition based on up to four columns. Benefits: faster queries with filters on those columns, lower cost (reduces bytes scanned). Clustering does not create a separate structure; it organizes data.
What is the difference between BigQuery slots and On‑Demand pricing?
Answer: On‑demand: pay per query bytes processed ($5 per TB). Slots: purchase dedicated compute capacity (flat rate), suitable for predictable workloads. Slots can reduce cost and provide consistent performance.
What is a BigQuery BI Engine?
Answer: In‑memory acceleration service that caches tables or specific columns, providing sub‑second query responses for BI tools. Works with Looker, Tableau, Data Studio. Good for dashboards on large data.
What is Spark Structured Streaming?
Answer: Spark’s stream processing API built on the DataFrame/Dataset API. It uses incremental execution model (micro‑batch) or continuous processing (low latency) and supports event‑time processing, watermarking, exactly‑once semantics.
How do you unit test data pipelines?
Answer: Test with sample data subsets: use pytest for Spark (with small DataFrames), Airflow task testing (using DagRun), or dbt unit tests. Use local test containers for databases. Mock external services.
What are integration tests for data pipelines?
Answer: Run pipelines on a staging environment with representative data volume, verifying end‑to‑end across source, transformations, and sink. Use test framework with isolated outputs, compare results to expected.
What is a CI/CD pipeline for data?
Answer: Automates building, testing, and deploying data pipeline code (SQL, Python, DAGs) and infrastructure (IaC). Steps: lint, unit tests, integration tests (on small data), deploy to staging, then production.
What is a data mesh?
Answer: Data mesh is a decentralized architectural paradigm where data is owned by domain teams who expose data as products. Principles: domain ownership, data as product, self‑serve data platform, federated governance.
What is a DataOps?
Answer: DataOps applies DevOps practices (CI/CD, monitoring, version control) to data pipelines. Emphasizes collaboration, automation, and reliability. Includes data testing, deployment pipeline, and feedback loops.
What is a data catalog search and discovery?
Answer: Ability to find datasets by keywords, tags, or business glossary. Data catalogs provide search, lineage, and ownership metadata. Improves analyst productivity.
What is a data profiling?
Answer: Data profiling analyzes source data to understand structure, distribution, nulls, duplicates, and value ranges. Helps create schema, data quality rules, and optimize pipelines. Tools: Great Expectations, Deequ, or custom SQL.
What is Apache Superset?
Answer: Superset is an open‑source BI and data visualization tool that works with many SQL databases (BigQuery, Snowflake, Presto). It provides dashboards, SQL Lab, and chart building. Not a data engineering tool per se, but used for serving data.
What is a data virtualisation?
Answer: Data virtualisation provides a unified query interface over multiple data sources without physically moving data. Used for real‑time integration, but may suffer performance. Tools: Denodo, Trino (formerly Presto), Dremio.
What is a cost attribution for data pipelines?
Answer: Tagging resources (bigquery jobs, dataproc clusters) with labels (team, project, dataset). Use cloud billing exports to allocate costs. Helps optimize spending.
What is the purpose of using coalesce vs repartition in Spark?
Answer: coalesce(n) reduces partitions without shuffle (only merges existing partitions), useful for decreasing partition count after filtering. repartition(n) does full shuffle, evenly redistributing data. Use coalesce for decreasing, repartition for increasing.
What is dynamic partition pruning in Spark?
Answer: Dynamic partition pruning (added in Spark 3.0) pushes down partition filters from the dimension table to the fact table at runtime, reducing scan in joins. Improves performance for partitioned tables.
How does Spark handle small file problem?
Answer: Small files (many tiny partitions) cause high overhead for metadata and tasks. Mitigation: coalesce after writing, use file compaction (merge jobs), set spark.sql.files.maxPartitionBytes to combine small files at read.
What is the difference between spark-submit master yarn and master k8s?
Answer: yarn runs Spark on Hadoop YARN cluster manager. k8s runs on Kubernetes – native container orchestration, better resource isolation and dynamic allocation. Both are valid; choose based on environment.
What is a Spark accumulator?
Answer: Accumulators are variables that aggregate values from executors to driver (e.g., counts, sums). They are write‑only from executors and only readable by driver. Used for statistics, not for real processing.
What is a broadcast variable in Spark?
Answer: Broadcast variables are read‑only, cached copies of values sent to all executors (e.g., lookup table). More efficient than sending with each task.
What is the purpose of checkpointing in Spark Structured Streaming?
Answer: Checkpointing saves the state of streaming query (offsets, progress, aggregation state) to reliable storage (HDFS, GCS). Allows recovery after failures and restarts. Required for exactly‑once.
What is output mode in Structured Streaming?
Answer: Append (only new rows), Update (modified rows in aggregation), Complete (full result). Append is common; Complete used for aggregates without watermark.
What is a sliding window vs tumbling window in streaming?
Answer: Tumbling windows are fixed, non‑overlapping intervals (e.g., 5 minutes). Sliding windows overlap (e.g., window length 10 minutes, slide 5 minutes). Used for moving averages.
What is a session window in streaming?
Answer: Session windows group events separated by a gap of inactivity (e.g., 5 minutes). Useful for user activity sessions. Supported in Beam, Flink, Spark Structured Streaming (with mapGroupsWithState).
Why should we hire you as a data engineer?
Answer: I have strong skills in building reliable, scalable data pipelines using Spark, Airflow, and SQL. I understand data modeling (star schema, SCD) and cloud data platforms (GCP/AWS). I prioritize data quality, observability, and cost efficiency. I can communicate with data scientists and business stakeholders, and I constantly look for ways to automate and improve pipelines. I also keep up with evolving technologies (Delta Lake, Iceberg, dbt).
Conclusion
You’ve made it to the end — and take a moment, because that alone says something remarkable about your dedication. You didn’t just skim through buzzwords. You dug into data pipelines, ETL processes, data modeling, Spark, SQL, cloud platforms, and all the invisible engineering that keeps the data world spinning. That’s no small feat. That’s the work of a true data engineer in the making.
The questions you’ve studied here aren’t just interview material — they’re the exact problems you’ll solve on the job, building systems that turn raw chaos into clean, reliable data. And now, you’re equipped not only to answer confidently but to think like an engineer who designs for scale, resilience, and clarity.
Walk into that interview with your head up. You’ve put in the reps, connected the dots, and built a foundation that can hold anything they throw at you. The data world is hungry for people like you — prepared, curious, and deeply capable. You’re ready to feed that hunger and build something great. Now go claim the role that’s waiting for you.