Python is very rich with data analytics libraries which brought me to use it for one of the use case. The same use case initially I tried to perform with Java also, but I found that the effort is more in java side. So just for the POC purpose, I choose python.
My sue case was to read and query on parquet file content without downloading it.
So to perform this activity, I converted the parquet file to streams with the help of BytesIO.
Once, I got the stream, I used pyArrow library which helped me to read this data into python data frames directly.
blob_service_client_instance = createBlobServiceClient()
blob_client_instance = blob_service_client_instance.get_blob_client(
CONTAINER_NAME, 'userdata1.parquet', snapshot=None)
blob_data = blob_client_instance.download_blob()
stream = BytesIO()
blob_data.readinto(stream)
processed_df = pd.read_parquet(stream, engine='pyarrow')
# print(processed_df.columns)
return processed_df
Once you have the data inside the dataframe, you can put query into it easily.
processed_df = donwloadParquetFile()
setDisplayForDataFrame(None,None,1000)
print(processed_df)
print (processed_df.head(3))
print("---------------------Query Output--------------------------------")
print(processed_df.query(query))
And this query will be your query.
In my case, I was passing it from my main method, which was int he form of :
first_name == "Amanda" and last_name == "Jordan"
To show the results in console, we can set the option in panda dataframe.
No comments:
Post a Comment