Amazon Athena now supports the open source Apache Spark distributed processing engine for running fast analytics workloads.
Data analysts and engineers can use Jupyter Notebook in Athena to perform data processing and interact programmatically with Spark applications.
Over time, different industries, such as financial services, healthcare, and retail, need to conduct more sophisticated analytics on different data formats and sizes. To facilitate complex data analytics, organizations have adopted Apache Spark. Apache Spark is a popular open source distributed processing system designed to run fast analytic workloads for data of all sizes.
However, building an infrastructure to run Apache Spark for interactive applications is not easy. Customers must provision, configure, and maintain the infrastructure on top of the applications. Not to mention performing optimal resource tuning to avoid applications starting slowly and suffering idle costs.
How it works
Since Amazon Athena for Apache Spark runs serverless, it’s beneficial for customers when performing interactive data exploration to gain insights without the need to provision and maintain resources to run Apache Spark. With this feature, customers can now create applications for Apache Spark using notebook operation, either directly from the Athena console or programmatically using the API.
Amazon Athena integrates with the AWS Glue Data Catalog, helping customers work with any data source in the AWS Glue Data Catalog, including data in Amazon S3. This opens up opportunities for customers to build applications for data analysis and visualization, for data exploration, for preparing datasets for machine learning pipelines.
API programmatic access
In addition to using the Athena console, I can also use programmatic access to interact with the Spark application in Athena. For example, I can create a workgroup with the create-work-group command, start a notebook with create-notebook, and start a notebook session with start-session.
Using programmatic access is useful when I need to execute commands, such as creating reports or calculating data, without having to open the Jupyter notebook.