Azure -> Parquet File-> Python -> Without download, read and query on parquet file
Python is very rich with data analytics libraries which brought me to use it for one of the use case. The same use case initially I tried to perform with Java also, but I found that the effort is more in java side. So just for the POC purpose, I choose python.
My sue case was to read and query on parquet file content without downloading it.
So to perform this activity, I converted the parquet file to streams with the help of BytesIO.
Once, I got the stream, I used pyArrow library which helped me to read this data into python data frames directly.
blob_service_client_instance = createBlobServiceClient()
blob_client_instance = blob_service_client_instance.get_blob_client(
CONTAINER_NAME, 'userdata1.parquet', snapshot=None)
blob_data = blob_client_instance.download_blob()
stream = BytesIO()
blob_data.readinto(stream)
processed_df = pd.read_parquet(stream, engine='pyarrow')
# print(processed_df.columns)
return processed_df
Once you have the data inside the dataframe, you can put query into it easily.
processed_df = donwloadParquetFile()
setDisplayForDataFrame(None,None,1000)
print(processed_df)
print (processed_df.head(3))
print("---------------------Query Output--------------------------------")
print(processed_df.query(query))
And this query will be your query.
In my case, I was passing it from my main method, which was int he form of :
first_name == "Amanda" and last_name == "Jordan"
To show the results in console, we can set the option in panda dataframe.