Encryption of PII or Sensitive data in Azure Databricks

Using fernet to encrypt key — symmetric encryption

Prerequisite

  • Azure Account
  • Azure Storage
  • Azure Databricks

Use Case

  • Use encryption to encrypt PII or other sensitive data
  • Data should be stored encrypted
  • Only folks who have access to key can decrypt
  • encrypt column level so only necessary columns can be encrypted and other’s are available for reporting
  • here is the open source encryption project —

Code

  • First create a Databricks cluster
  • install cryptography library
  • After cluster starts go to library
  • Select pypi and type: cryptography and install
  • wait for the package to install
  • Create a new notebook
%sql 
use default; -- Change this value to some other database if you do not want to use the Databricks default
drop table if exists Test_Encryption;create table Test_Encryption(Name string, Address string, ssn string) USING DELTA;%sql
insert into Test_Encryption values ('Mark Smith', 'my street, universe', '6789454');
insert into Test_Encryption values ('King Solomon', 'somewhere in earth, Artic', '98023456');
  • Sample code to test the encryption library
from cryptography.fernet import Fernet
# >>> Put this somewhere safe!
key = Fernet.generate_key()
f = Fernet(key)
token = f.encrypt(b"A really secret message. Not for prying eyes.")
print(token)
print(f.decrypt(token))
  • Now create UDF for encrypt and decrypt
# Define Encrypt User Defined Function 
def encrypt_val(clear_text,MASTER_KEY):
from cryptography.fernet import Fernet
f = Fernet(MASTER_KEY)
clear_text_b=bytes(clear_text, 'utf-8')
cipher_text = f.encrypt(clear_text_b)
cipher_text = str(cipher_text.decode('ascii'))
return cipher_text
# Define decrypt user defined function
def decrypt_val(cipher_text,MASTER_KEY):
from cryptography.fernet import Fernet
f = Fernet(MASTER_KEY)
clear_val=f.decrypt(cipher_text.encode()).decode()
return clear_val
  • now lets encrypt a column
from pyspark.sql.functions import udf, lit, md5
from pyspark.sql.types import StringType
# Register UDF's
encrypt = udf(encrypt_val, StringType())
decrypt = udf(decrypt_val, StringType())
# Fetch key from secrets
# encryptionKey = dbutils.preview.secret.get(scope = "encrypt", key = "fernetkey")
encryptionKey = key
# Encrypt the data
df = spark.table("Test_Encryption")
encrypted = df.withColumn("ssn", encrypt("ssn",lit(encryptionKey)))
display(encrypted)
#Save encrypted data
encrypted.write.format("delta").mode("overwrite").option("overwriteSchema", "true").saveAsTable("Test_Encryption_Table")
  • now decrypt
decrypted = encrypted.withColumn("ssn", decrypt("ssn",lit(encryptionKey)))
display(decrypted)