Source data in this bucket contains raw transaction data in JSON format. S3 bucket “aws-simplified-athena-demo” contains source data I want to query. I have two buckets created in S3, one contains raw source data to read, and another is to store the results of SQL queries executed in AWS Athena. How to read data from Amazon S3 using AWS Athenaīefore we go to how-to steps, let’s first look at the sample data in the S3 bucket we want to read… If you prefer to learn through video, check out my YouTube video on this guide here. In this article, I will show you how to read the data stored in Amazon S3 bucket by connecting it to AWS Athena and executing some SQL queries. Amount of data scanned by the query is displayed after each execution in the console. The cost is determined by the amount of data scanned by each query executed. It follows a pay-as-you-go pricing model where you pay for the queries you run. You can also use complex joins, window functions and complex data types on Athena. You can write Hive-compliant DDL statements and ANSI SQL statements in the Athena query editor. It uses Apache Hive to create, drop, and alter tables and partitions. This eliminates the need for any data loading or ETL. It uses an approach known as schema-on-read, which allows you to project your schema onto your data at the time you execute a query. Why do we do these two things? Because they save money.Wondering what is AWS Athena and how it works? Let me give you a quick introduction…ĪWS Athena is a serverless query service which can be used to read and analyze large amounts of data directly from S3 using Standard SQL.Note that the Fact table schema shouldn’t be changed. Try to develop an STG table for youtubeshorten before joining it to categories.The Process I developed here is not best practice (the YouTubeShorten should be transformed before I joined it with YouTubeCategoies – can you tell why?).** Notice that the last 2 tables are native tables, so the data is in BigQuery and not in Google Storage. Now you can Insert your result into another table (YouTubeFact).We need to Join these Tables by Category_Id and by Country.In this step we will join the tables: YouTubeCategories and YouTubeVideosShorten.Make sure you can now view this Table on your Table list.Give this table the name “YouTubeCategories”, and then – save it.After the query Run – click “Save Results”, click “BigQuery” and then “Save”.We will extract categories from the Json file.The new table we create will be named – YouTubeCategories.** Notice that the UNNEST() function explodes the JSON (which is 1 row of ) and allocates the kind and etag to all the other rows. ** In this case we use the table as “External” After creating your table – make sure You see your table in the table list.Create a new folder in your bucket named YouTubeStatistics and put the files there.The Table is for the Ingestion Level (MRR) and should be named – YouTubeStatisctics. Because the data is structured – this use case is simpler. The Table for the Ingestion Level (MRR) YouTubeStatisctics. Query the JSON filesīecause the data is semi-structured – this use case is a little more difficult We also select the first 2 letters of each file and get our country. In this case, we use the table as “External” and we calculate the duration each video has gone trendy from his upload. After creating your table – make sure you see your table in the table list.(you can see my configuration in the following picture) Create a new folder in your bucket named YouTubeVideos and put the files there.The Table is for the Ingestion Level (MRR), and should be named – YouTubeVideosShorten. The whole process is as follows: Query the CSV Files I will present two examples – one over CSV Files and another over JSON Files, you can find them here.Īfter that, we will create tables for those files, and join both tables. In this case, I needed to create 2 tables that hold you tube data from Google Storage. This article is mainly for BI Developers who want to expand their capabilities to handle Big Data and finished successfully part one. In this article, we will start querying CSVs and JSONs in Google Cloud Storage (GCS) and create new tables from existing tables (ETL process).
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |