How to access the Amazon S3 files from Athena data source
Accessing Amazon S3 Files in Amazon Athena
- Amazon Athena is an interactive query service that makes it easy to analyze data directly from Amazon S3 using standard SQL.
- Athena is serverless, so there is no infrastructure to set up or manage and you can start analyzing your data immediately.
- You don’t even need to load your data into Athena or have complex ETL processes. Athena works directly with data stored in S3.
Access Permission for Amazon S3:
- If the users need to create tables and work with underlying data, they must have access to the Amazon S3 location of the data.
- This access is in addition to the allowed actions for Athena that you define in IAM identity-based polices.
- You can grant access to Amazon S3 locations using identity-based policies, bucket resource policies, or both.
Creating Table in Amazon Athena:
- Athena reads all data stored in s3://bucketname/folder/'.
- Athena does not modify your data in Amazon S3.
- When you create a database and table in Athena, you are simply describing the schema and the location where the table data are located in Amazon S3 for read-time querying.
- Database and table, therefore, have a slightly different meaning than they do for traditional relational database systems because the data isn’t stored along with the schema definition for the database and table.
- Amazon Athena provides the easiest way to run ad-hoc queries for data in S3 without the need to setup or manage any servers.
Create table query:
Syntax:
CREATE EXTERNAL TABLE test_table(
...
)
ROW FORMAT ...
STORED AS INPUTFORMAT ...
OUTPUTFORMAT ...
LOCATION s3://bucketname/folder/
Example for table creation for a csv file:
CREATE EXTERNAL TABLE flight_delays_csv (
year INT,
quarter INT,
month INT,
dayofmonth INT,
dayofweek INT,
flightdate STRING,
uniquecarrier STRING,
airlineid INT,
carrier STRING,
tailnum STRING,
flightnum STRING )
PARTITIONED BY (year STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
LOCATION 's3://athena-examples-myregion/flight/csv/';
All tables are considered external:
If you use CREATE TABLE without the EXTERNAL keyword, Athena issues an error; only tables with the EXTERNAL keyword can be created. We recommend that you always use the EXTERNAL keyword. When you drop a table in Athena, only the table metadata is removed; the data remains in Amazon S3.
We also have the option to create the table using Athena’s table wizard (accessible from the Catalog Manager).
Then we can specify the format:
And the name and data type of each column:
We can also set up the partition model:
We can query the table in Amazon Athena using SELECT Statements.
Considerations:
1.) Files treated as hidden – Athena treats sources files that start with an underscore (_) or a dot (.) as hidden. To work around this limitation, rename the files.
2.) When creating schemas for data on S3, the positional order is important. For example, if you have a source file with ID, DATE and RESPONSE columns, then your schema should reflect that structure.
Additional Reference:
https://aws.amazon.com/blogs/big-data/analyzing-data-in-s3-using-amazon-athena/
https://docs.aws.amazon.com/athena/latest/ug/tables-location-format.html
https://aws.amazon.com/blogs/aws/amazon-athena-interactive-sql-queries-for-data-in-amazon-s3/