top of page
Blog SO Labs na Íntegra
Writer's pictureDaniel Sanches

PySpark Read and Write Operations AWS S3 Bucket

Updated: Sep 27, 2022






What is PySpark?


PySpark is the Python API for Apache Spark, an open source distributed computing framework and set of libraries for large-scale, real-time data processing. If you're already familiar with Python and libraries like Pandas, PySpark is a good language to learn how to build more scalable analytics and pipelines.


Apache Spark is basically a computational engine that works with large datasets by processing them in parallel and batch systems. Spark is written in Scala and PySpark was released to support Spark and Python collaboration. In addition to providing an API for Spark, PySpark helps you interact with Resilient Distributed Datasets (RDDs) by leveraging the Py4j library.

We have...


The purpose of this article is to understand basic read and write operations in Amazon Web Storage Service S3. To be more specific, perform read and write operations on AWS S3 using the Apache Spark Python API PySpark.

Apache Spark + AWS S3 Bucket



Operation and Development: PySpark + AWS S3 Bucket



1. Configuring Spark Session on Spark Standalone Cluster


import findspark 
findspark.init() 
import pyspark 
de pyspark.sql import SparkSession 
de pyspark import SparkContext, SparkConf

import os 
os.environ['PYSPARK_SUBMIT_ARGS'] = '-- packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'

Set the Spark Connect to Spark Session properties:



#spark configuration 
conf = SparkConf().set('spark.executor.extraJavaOptions','-Dcom.amazonaws.services.s3.enableV4=true'). \ 
set('spark.driver.extraJavaOptions','-Dcom.amazonaws.services.s3.enableV4=true'). \ 
setAppName('pyspark_aws').setMaster('local[*]')

sc=SparkContext(conf=conf) 
sc.setSystemProperty('com.amazonaws.services.s3.enableV4', 'true')

print('módulos importados')

Set Spark Hadoop properties for all worker nodes as below:



accessKeyId='xxxxxxxxxxxx' 
secretAccessKey='xxxxxxxxxxxxxxx'

hadoopConf = sc._jsc.hadoopConfiguration() 
hadoopConf.set('fs.s3a.access.key', accessKeyId) 
hadoopConf.set('fs.s3a.secret.key', secretAccessKey) 
hadoopConf.set('fs.s3a. endpoint', 's3-us-east-2.amazonaws.com') 
hadoopConf.set('fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')

spark=SparkSession(sc)

Set Spark Hadoop properties for all worker nodes as below:


s3a to write: Currently, there are three ways one can read or write files: s3, s3n and s3a. In this post, we would be dealing with s3a only as it is the fastest. Please note that s3 would not be available in future releases.


v4 authentication: AWS S3 supports two versions of authentication — v2 and v4. For more details consult the following link: Authenticating Requests (AWS Signature Version 4) — Amazon Simple Storage Service



2. Read the dataset present on local system


emp_df=spark.read.csv(‘D:\python_coding\GitLearn\python_ETL\emp.dat’,header=True,inferSchema=True)
emp_df.show(5)


3. PySpark Dataframe to AWS S3 Storage



emp_df.write.format('csv').option('header','true').save('s3a://pysparkcsvs3/pysparks3/emp_csv/emp.csv',mode='overwrite')


Verify the dataset in S3 bucket as below:



We have successfully written Spark Dataset to AWS S3 bucket “pysparkcsvs3”.



4. Read Data from AWS S3 into PySpark Dataframe


s3_df=spark.read.csv(‘s3a://pysparkcsvs3/pysparks3/emp_csv/emp.csv/’,header=True,inferSchema=True)
s3_df.show(5)


We have successfully written and retrieved the data to and from AWS S3 storage with the help of PySpark.



5. Done!

Performed the PySpark AWS S3 read and write operations. Mission complete!



__________________________________________________________________________________


Use Case Conclusion


The 2019 Global Data Quality Survey identified that, although for 84% of companies the volume of information continues to be managed solely or primarily by IT, the area accounts for only 53% of new data-driven projects. In this scenario, there is already a joint action between several areas working together (28%) and teams led by Chief Data Officers (24%).


  • 75% agree that responsibility for data should be in several departments, with the occasional help from the IT area – in Brazil, this perception is 68%;

  • 13% said they had already adopted a decentralized management of information – in Brazil, the indicator drops to 6%;

  • 56% have the perception that IT does not fully understand the data management needs of users – and 57% of the answers correspond to the opinion of professionals who work in the same area.


“We see more organizations starting to establish sound ownership and stronger data leadership, led by a Chief Data Officer (CDO). This is a fundamental guideline to implement strategies capable of strengthening compliance and information security, as well as ensuring that the right people have access to reliable data, to take optimal advantage of insights, which can provide more informed decisions and, consequently, better results. for business,” said Junqueira.


__________________________________________________________________________________


Notes and Related


> Initially, our purpose with the Blog is to make different posts, but we will have a separate area for posts related to Constructor SO, which is our Project Portfolio, Agile and Scrum, in which each member of Constructor SO has its area for their developments. In this way, each update in the SO Constructor area is followed by a post on the professional's blog, informing our readers and thus creating an extensive overview of such released or versioned work;


> A priori in relation to the developments of Space_One Labs, our idea is to randomly launch and work on several projects in the specific related area, thus not becoming limited by specific apps or software;


> All the cases described and developed here, for this blog, in which I belong, that are in the "BI Case" category, are cases of fictitious companies, created and invented, to contextualize and make the work as alive and realistic as possible.



__________________________________________________________________________________


Daniel Sanches


Production Engineer, Universo - Universidade Salgado de Oliveira, specialized in Analytics and Data, Business and Technology


SO Labs Developer Member of Research, Business Intelligence Analyst, Data Analyst and Business Analyst, Data Engineer, Blogger DS Space_One Labs | Space Members


Member SO Labs Research, Business Intelligence, Data Engineer, Data Analyst and Business Developer


Comments


bottom of page