Databricks provides powerful capabilities for working with Delta tables, enabling users to manage data efficiently. One of the most exciting features is Time Travel, which allows you to access previous versions of a table, track changes, and recover lost data. In this blog, we’ll walk through a hands-on demonstration of Delta Table Time Travel in Databricks.
Understanding Delta Tables
Delta tables enable structured data management in Apache Spark with ACID transactions, schema enforcement, and indexing. Using Spark SQL, you can perform transformations such as filtering, grouping, and ordering before saving results as a Delta table. Once saved, these tables are accessible across applications and can be queried using Databricks SQL.
Setting Up a Delta Table in Databricks
To begin, we use a Databricks notebook to create and interact with a Delta table:
Open your Python notebook under the Workspace section.
Attach the notebook to a running cluster.
Create a DataFrame and store it in a Delta table using the
write
command.
ordered_df.write.format("delta").mode("overwrite").saveAsTable("Clarity_Effect_Skills")
This command saves the data in a Delta format, making it accessible via Spark SQL.
Querying the Delta Table
Once the table is created, you can query it using Spark SQL:
SELECT * FROM Clarity_Effect_Skills;
Alternatively, you can use Python:
df = spark.sql("SELECT * FROM Clarity_Effect_Skills")
df.show()
Databricks also allows querying through a SQL notebook for a seamless experience.
Tracking Changes with Describe History
Delta tables maintain an audit trail of changes. You can retrieve this history using the following SQL command:
DESCRIBE HISTORY Clarity_Effect_Skills;
This command displays all operations, such as insert, delete, update, along with timestamps and user details. It’s a useful feature for auditing and debugging.
Performing Updates and Inserts
Let’s modify the table by updating a record:
UPDATE Clarity_Effect_Skills SET clarity = 'pH Scales' WHERE clarity = 'C2';
After performing this update, running DESCRIBE HISTORY
again will show a new version of the table.
Delta Table Storage and Versioning
Each table version is stored as Parquet files under the Hive Warehouse directory in Databricks. You can list the files using the following command:
%fs ls dbfs:/user/hive/warehouse/clarity_effect_skills
Each update generates a new version, enabling Time Travel functionality.
Time Travel: Querying Previous Versions
With Time Travel, you can view previous versions of a table by specifying a timestamp or version number.
Query by Timestamp
To query a past version based on a timestamp:
SELECT * FROM Clarity_Effect_Skills TIMESTAMP AS OF '2024-03-01T12:00:00Z';
Query by Version Number
To retrieve data from a specific version:
SELECT * FROM Clarity_Effect_Skills VERSION AS OF 1;
By accessing older versions, users can recover lost records, analyze historical trends, and validate data changes.
Use Cases of Time Travel
Accidental Data Deletion Recovery – Restore data lost due to unintended modifications.
Historical Data Analysis – Compare different table versions to analyze trends.
Auditing and Compliance – Track who made changes and when they occurred.
Conclusion
Delta Table Time Travel is a powerful feature that enhances data reliability, making it easy to track changes and recover previous data states. With Databricks, users can leverage SQL and Spark APIs to interact with Delta tables efficiently. Whether you are debugging, auditing, or restoring lost records, Time Travel provides a seamless solution for managing historical data.