{"id":177334,"date":"2024-01-21T17:58:24","date_gmt":"2024-01-21T16:58:24","guid":{"rendered":"https:\/\/liora.io\/en\/?p=177334"},"modified":"2026-02-06T08:33:30","modified_gmt":"2026-02-06T07:33:30","slug":"aws-glue-what-is-it-whats-it-for","status":"publish","type":"post","link":"https:\/\/liora.io\/en\/aws-glue-what-is-it-whats-it-for","title":{"rendered":"AWS Glue: What is it? What&#8217;s it for?"},"content":{"rendered":"<h2>AWS Glue is a fully managed, scalable data processing service that enables users to run serverless ETL (Extract, Transform, Load) workflows, freeing them from the need to manage the underlying infrastructure.<\/h2>\t\t\n\t\t\t<h3>A reminder about ETL processes<\/h3>\t\t\n\t\t<p><strong>ETL<\/strong> is a process designed to guarantee data quality and availability. It is divided into 3 phases:<\/p><ul><li><strong>Extraction<\/strong>: recovery of data from various sources<\/li><li><strong>Transformation:<\/strong> <a href=\"https:\/\/liora.io\/en\/data-cleaning-definition-methods-and-relevance-in-data-science\">cleansing, normalizing and modifying data to make it usable<\/a><\/li><li><strong>Loading:<\/strong> loading transformed data into a final environment, such as a database or <a href=\"https:\/\/liora.io\/en\/data-warehouse-2\">data warehouse.<\/a><\/li><\/ul>\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t<figure>\n\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/liora.io\/app\/uploads\/2023\/04\/image3-5-e1681996722906.png\" title=\"\" alt=\"\" loading=\"lazy\">\t\t\t\t\t\t\t\t\t\t\t<figcaption>Source : Informatica.com<\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t<h3>How is AWS Glue structured?<\/h3>\t\t\n\t\t<p><strong>AWS Glu<\/strong>e jobs perform the necessary extraction, transformation and loading of data from a source to a destination. The following diagram shows the architecture of <strong>AWS Glue,<\/strong> and then we describe the various elements:<\/p>\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t<figure>\n\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/liora.io\/app\/uploads\/2023\/04\/image1-7.png\" title=\"\" alt=\"\" loading=\"lazy\">\t\t\t\t\t\t\t\t\t\t\t<figcaption><\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t<ul><li style=\"font-weight: 400\"><strong>Data Catalog:<\/strong> this is the permanent metadata storage in<strong> AWS Glue<\/strong>. It contains table definitions, job definitions, etc.<\/li><li style=\"font-weight: 400\"><strong>Database:<\/strong> a set of table definitions for associated data catalogs.<\/li><li style=\"font-weight: 400\"><strong>Crawler:<\/strong> a program that connects to a data source to extract its data and determine its structure. It then uses this information to create table definitions in the data catalog.<\/li><li style=\"font-weight: 400\"><strong>Connection:<\/strong> this<strong> AWS Glue<\/strong> connection is the data catalog that contains the information needed to connect to a certain data store.<\/li><li style=\"font-weight: 400\"><strong>Classifier:<\/strong> determines the data schema.<strong> AWS Glue<\/strong> provides classifiers for the most common file types, such as CSV, Json, etc.<\/li><li style=\"font-weight: 400\"><strong>Data store<\/strong>: repository for persistent data storage.<\/li><li style=\"font-weight: 400\"><strong>Data source<\/strong>: this is the entry point used for the transformation process.<\/li><li style=\"font-weight: 400\"><strong>Data target<\/strong>: the target to which the transformed data will be written.<\/li><li style=\"font-weight: 400\"><strong>Job:<\/strong> the business logic required for ETL jobs, made up of the various elements required.<\/li><\/ul>\t\t\n\t\t\t\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex is-content-justification-center\"><div class=\"wp-block-button \"><a class=\"wp-block-button__link wp-element-button \" href=\"\/en\/courses\/cloud-dev\/aws-solutions-architect\">Learn how to use AWS Glue<\/a><\/div><\/div>\n\n\t\t\t<h3>AWS Glue features <\/h3>\t\t\n\t\t<p><strong>AWS Glue<\/strong> allows you to fully manage your <strong>ETL processes<\/strong> through a variety of features, the most important of which are listed below:<\/p>\t\t\n\t\t\t\n.tg  {border-collapse:collapse;border-spacing:0;}\n.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;\n  overflow:hidden;padding:10px 5px;word-break:normal;}\n.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;\n  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}\n.tg .tg-yj5y{background-color:#efefef;border-color:inherit;text-align:center;vertical-align:top}\n.tg .tg-dvid{background-color:#efefef;border-color:inherit;font-weight:bold;text-align:left;vertical-align:top}\n.tg .tg-y698{background-color:#efefef;border-color:inherit;text-align:left;vertical-align:top}\n\n<table style=\"undefined;width: 800px\">\n<colgroup>\n<col style=\"width: 100px\">\n<col style=\"width: 250px\">\n<col style=\"width: 450px\">\n<\/colgroup>\n<thead>\n  <tr>\n    <th><img decoding=\"async\" src=\"https:\/\/liora.io\/app\/uploads\/2023\/04\/image7-2.png\" alt=\"Image\" width=\"100\" height=\"100\"><\/th>\n    <th>Data Collection and Integration<\/th>\n    <th>AWS Glue allows for the collection and integration of data from various sources, including databases, flat files, streaming data, etc.<\/th>\n  <\/tr>\n<\/thead>\n<tbody>\n  <tr>\n    <td><img decoding=\"async\" src=\"https:\/\/liora.io\/app\/uploads\/2023\/04\/image9-1.png\" alt=\"Image\" width=\"100\" height=\"93\"><\/td>\n    <td>Data Transformation<\/td>\n    <td>Provides a set of tools for transforming data, including data processing functions, filtering, sorting, joining, and more.<\/td>\n  <\/tr>\n  <tr>\n    <td><img decoding=\"async\" src=\"https:\/\/liora.io\/app\/uploads\/2023\/04\/image5-4.png\" alt=\"Image\" width=\"100\" height=\"100\"><\/td>\n    <td>Data Catalog<\/td>\n    <td>Allows for the creation and management of a metadata catalog that facilitates data discovery, search, and analysis.<\/td>\n  <\/tr>\n  <tr>\n    <td><img decoding=\"async\" src=\"https:\/\/liora.io\/app\/uploads\/2023\/04\/image11-1.png\" alt=\"Image\" width=\"100\" height=\"100\"><\/td>\n    <td>ETL Task Execution and Scheduling<\/td>\n    <td>AWS Glue enables the scheduling and execution of ETL tasks to process data at scale.<\/td>\n  <\/tr>\n  <tr>\n    <td><img decoding=\"async\" src=\"https:\/\/liora.io\/app\/uploads\/2023\/04\/image4-7.png\" alt=\"Image\" width=\"100\" height=\"100\"><\/td>\n    <td>Workflow Automation<\/td>\n    <td>Offers workflow automation features to orchestrate complex tasks involving multiple steps.<\/td>\n  <\/tr>\n  <tr>\n    <td><img decoding=\"async\" src=\"https:\/\/liora.io\/app\/uploads\/2023\/04\/image8-2.png\" alt=\"Image\" width=\"100\" height=\"100\"><\/td>\n    <td>Custom Jobs<\/td>\n    <td>Enables the creation of custom jobs to address specific use cases. Custom jobs can be created using common programming languages such as Python and Scala.<\/td>\n  <\/tr>\n  <tr>\n    <td><img decoding=\"async\" src=\"https:\/\/liora.io\/app\/uploads\/2023\/04\/image2-9.png\" alt=\"Image\" width=\"100\" height=\"99\"><\/td>\n    <td>Error Handling<\/td>\n    <td>Allows for the management of errors encountered during data processing, such as syntax errors or connectivity issues.<\/td>\n  <\/tr>\n  <tr>\n    <td><img decoding=\"async\" src=\"https:\/\/liora.io\/app\/uploads\/2023\/04\/image6-5.png\" alt=\"Image\" width=\"100\" height=\"100\"><\/td>\n    <td>Monitoring<\/td>\n    <td>AWS Glue provides monitoring features to track ETL job performance, detect errors and performance issues, and optimize resource utilization.<\/td>\n  <\/tr>\n<\/tbody>\n<\/table>\n\t\t\t<h3>Advantages and disadvantages of AWS Glue<\/h3>\t\t\n\t\t<p>Before embarking on using and learning<strong> AWS Glue,<\/strong> it&#8217;s important to consider both its advantages and disadvantages:<\/p>\t\t\n\t\t\t\n.tg  {border-collapse:collapse;border-spacing:0;}\n.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;\n  overflow:hidden;padding:10px 5px;word-break:normal;}\n.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;\n  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}\n.tg .tg-d78e{background-color:#9aff99;text-align:center;vertical-align:top}\n.tg .tg-cmwg{background-color:#ffccc9;text-align:center;vertical-align:top}\n.tg .tg-s47q{background-color:#9aff99;font-size:20px;font-weight:bold;text-align:center;vertical-align:top}\n.tg .tg-rjvs{background-color:#ffccc9;font-size:20px;font-weight:bold;text-align:center;vertical-align:top}\n\n<table style=\"undefined;width: 600px\">\n<colgroup>\n<col style=\"width: 300px\">\n<col style=\"width: 300px\">\n<\/colgroup>\n<thead>\n  <tr>\n    <th>Advantages<\/th>\n    <th>Disadvantages<\/th>\n  <\/tr>\n<\/thead>\n<tbody>\n  <tr>\n    <td>Large-scale data management<\/td>\n    <td>High costs for small businesses or small-scale projects, despite being a fully managed service<\/td>\n  <\/tr>\n  <tr>\n    <td>Fast data processing<\/td>\n    <td>Steep learning curve<\/td>\n  <\/tr>\n  <tr>\n    <td>Integration with other AWS services<\/td>\n    <td>Limited workflow customizations<\/td>\n  <\/tr>\n  <tr>\n    <td>Support for multiple programming languages<\/td>\n    <td>Requires expertise in data engineering<\/td>\n  <\/tr>\n  <tr>\n    <td>Fully managed platform<\/td>\n    <td rowspan=\"2\"><\/td>\n  <\/tr>\n  <tr>\n    <td>Built-in metadata catalog<\/td>\n  <\/tr>\n<\/tbody>\n<\/table>\n\t\t\t<h3>Conclusion<\/h3>\t\t\n\t\t<p>As you&#8217;ve probably gathered by now, AWS Glue is a fully managed <a href=\"https:\/\/liora.io\/en\/etl-or-extract-transform-load-definition-and-use\">Amazon AWS ETL workflow service.<\/a> Its great power and flexibility nevertheless require a steep learning curve and a very substantial investment in order to set it up to meet the required needs.<\/p><p>&nbsp;<\/p><p>? Related articles:<\/p><table dir=\"ltr\" border=\"1\" cellspacing=\"0\" cellpadding=\"0\" data-sheets-root=\"1\"><colgroup><col width=\"656\"><\/colgroup><tbody><tr><td data-sheets-value=\"{&quot;1&quot;:2,&quot;2&quot;:&quot;AWS Elastic Load Balancer: The solution that distributes network traffic&quot;}\" data-sheets-hyperlink=\"https:\/\/liora.io\/en\/aws-elastic-load-balancer-the-solution-that-distributes-network-traffic\"><a href=\"https:\/\/liora.io\/en\/aws-elastic-load-balancer-the-solution-that-distributes-network-traffic\" target=\"_blank\" rel=\"noopener\">AWS Elastic Load Balancer: The solution that distributes network traffic<\/a><\/td><\/tr><tr><td data-sheets-value=\"{&quot;1&quot;:2,&quot;2&quot;:&quot;Jam AWS: The playful Amazon learning platform&quot;}\" data-sheets-hyperlink=\"https:\/\/liora.io\/en\/jam-aws-the-playful-learning-platform-from-amazon\"><a href=\"https:\/\/liora.io\/en\/jam-aws-the-playful-learning-platform-from-amazon\" target=\"_blank\" rel=\"noopener\">Jam AWS: The playful Amazon learning platform<\/a><\/td><\/tr><tr><td data-sheets-value=\"{&quot;1&quot;:2,&quot;2&quot;:&quot;AWS Lambda: Introduction to the Serverless Function&quot;}\" data-sheets-hyperlink=\"https:\/\/liora.io\/en\/aws-lambda-introduction-to-the-serverless-function\"><a href=\"https:\/\/liora.io\/en\/aws-lambda-introduction-to-the-serverless-function\" target=\"_blank\" rel=\"noopener\">AWS Lambda: Introduction to the Serverless Function<\/a><\/td><\/tr><tr><td data-sheets-value=\"{&quot;1&quot;:2,&quot;2&quot;:&quot;AWS Certification: What is it and how do I get it? &quot;}\" data-sheets-hyperlink=\"https:\/\/liora.io\/en\/aws-certification-what-is-it-and-how-do-i-get-it\"><a href=\"https:\/\/liora.io\/en\/aws-certification-what-is-it-and-how-do-i-get-it\" target=\"_blank\" rel=\"noopener\">AWS Certification: What is it and how do I get it? <\/a><\/td><\/tr><tr><td data-sheets-value=\"{&quot;1&quot;:2,&quot;2&quot;:&quot;AWS SageMaker: A guide for using the platform&quot;}\" data-sheets-hyperlink=\"https:\/\/liora.io\/en\/aws-sagemaker-a-guide-for-using-the-platform\"><a href=\"https:\/\/liora.io\/en\/aws-sagemaker-a-guide-for-using-the-platform\" target=\"_blank\" rel=\"noopener\">AWS SageMaker: A guide for using the platform<\/a><\/td><\/tr><tr><td data-sheets-value=\"{&quot;1&quot;:2,&quot;2&quot;:&quot;5 AWS launches and announcements making developers\u2019 life easy in 2022&quot;}\" data-sheets-hyperlink=\"https:\/\/liora.io\/en\/5-aws-launches-and-announcements-making-developers-life-easy-in-2022\"><a href=\"https:\/\/liora.io\/en\/5-aws-launches-and-announcements-making-developers-life-easy-in-2022\" target=\"_blank\" rel=\"noopener\">5 AWS launches and announcements making developers\u2019 life easy in 2022<\/a><\/td><\/tr><\/tbody><\/table>\t\t\n\t\t\t\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex is-content-justification-center\"><div class=\"wp-block-button \"><a class=\"wp-block-button__link wp-element-button \" href=\"\/en\/courses\/cloud-dev\/aws-solutions-architect\">Discover the AWS Glue service<\/a><\/div><\/div>\n","protected":false},"excerpt":{"rendered":"<p>AWS Glue is a fully managed, scalable data processing service that enables users to run serverless ETL (Extract, Transform, Load) workflows, freeing them from the need to manage the underlying infrastructure. A reminder about ETL processes ETL is a process designed to guarantee data quality and availability. It is divided into 3 phases: Extraction: recovery [&hellip;]<\/p>\n","protected":false},"author":76,"featured_media":177351,"comment_status":"open","ping_status":"open","sticky":false,"template":"elementor_theme","format":"standard","meta":{"_acf_changed":false,"editor_notices":[],"footnotes":""},"categories":[2433],"class_list":["post-177334","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-ai"],"acf":[],"_links":{"self":[{"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/posts\/177334","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/users\/76"}],"replies":[{"embeddable":true,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/comments?post=177334"}],"version-history":[{"count":1,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/posts\/177334\/revisions"}],"predecessor-version":[{"id":206086,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/posts\/177334\/revisions\/206086"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/media\/177351"}],"wp:attachment":[{"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/media?parent=177334"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/categories?post=177334"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}