Airflow emr

2/18/2024

SSO was used to create different permissions for different roles and environments. Each environment has its own VPC, a requirement for using Managed Workflows for Apache Airflow (MWAA) as it doesn’t support deployments in a shared subnet setup. We created three different environments: SANDBOX, TEST, and PRODUCTION. We began by setting up a Landing Zone with a multi-account structure and centralized access management based on AWS SSO. Let’s look at the architecture we came up with to address the customer’s needs. They considered replacing that with a managed service because having a strong dependency on an On-Prem resource was risky. They were using an On-Prem installation of Apache Airflow to orchestrate these jobs. Modernizing the ETL jobs that rely on that database was within this project’s scope. Moreover, a license for an Oracle database in their On-Prem environment was expiring, and they didn’t want to renew it. Not all best practices regarding access control and isolation of production environments were followed because the system had grown organically over the years without clear ownership of the whole architecture. They had an EMR cluster that was provisioned through a script and then manually customized, which resulted in stability and reproducibility issues. The primary purpose of the solution is to process event data and enable timely reporting. The customer was looking to modernize the existing data analytics pipeline because they had some pain points they wanted to resolve. Today, we’ll share a story of a modernization project that we did for a customer in the online marketing industry. However, some customers prefer to rely on Open-Source projects and tools to access more talent in the job market and be less reliant on a single cloud provider. A combination of Serverless tools such as Athena, StepFunctions, Lambda, or Glue can get the job done in many projects. Transforming large amounts of data into formats that help solve business problems is what data engineers excel at. set_downstream ( steps_added_op ) dag_fragment_steps. copy_op_attrs ( step_op, emr_add_steps_op ) self. spark_conf, task_id = target_step_task_id, dag = self. azure_conn_id, cluster_name = create_op. dag ) else : step_op = LivyBatchOperator ( name = name, file = livy_file, arguments = livy_arguments, class_name = livy_main_class, azure_conn_id = create_op. azure_conn_id, command = ssh_command, task_id = target_step_task_id, dag = self. cluster_name, azure_conn_id = create_op. index ( step )) if ssh_command is not None : step_op = AzureHDInsightSshOperator ( cluster_name = create_op. get_target_step_task_id ( emr_add_steps_op. spark_conf = properties target_step_task_id = EmrAddStepsOperatorTransformer. find_op_in_fragment_list ( upstream_fragments, operator_type = ConnectedAzureHDInsightCreateClusterOperator, task_id = create_op_task_id ) if not create_op : raise UpstreamOperatorNotFoundException ( ConnectedAzureHDInsightCreateClusterOperator, EmrAddStepsOperator ) emr_add_steps_op : EmrAddStepsOperator = src_operator dag_fragment_steps = steps_added_op = DummyOperator ( task_id = f " " self. job_flow_id ) create_op : BaseOperator = \ get_task_id_from_xcom_pull ( src_operator. note:: The spark configuration for the livy spark job are derived from `step` of the EMR step, or could even be specified at the cluster level itself when transforming the job flow """ create_op_task_id = TransformerUtils. note:: This transformer creates multiple operators from a single source operator. Def transform ( self, src_operator : BaseOperator, parent_fragment : DAGFragment, upstream_fragments : List ) -> DAGFragment : """ This transformer assumes and relies on the fact that an upstream transformation of a :class:`~_create_job_flow_operator.EmrCreateJobFlowOperator` has already taken place, since it needs to find the output of that transformation to get the `cluster_name` and `azure_conn_id` from that operator (which should have been a :class:`~`) It then goes through the EMR steps of this :class:`~_add_steps_operator.EmrAddStepsOperator` and creates a :class:`~` or an :class:`~` for each corresponding step, based on grokking the step's params and figuring out whether its a spark job being run on an arbitrary hadoop command like `distcp`, `hdfs` or the like.

0 Comments

I'm James. This is my year of travel.

Airflow emr

Leave a Reply.

Author

Archives

Categories