- Docker
- Bash
Alluxio is a data orchestration layer allowing a single access point for data access. It allows mounting of object stores, hdfs and nfs onto a single access layer. User access can be managed on the Alluxio using ACLs.
- Navigate into the
directory and run the following command
docker build -f Dockerfile -t data_platform/alluxio:latest .
Ensure the cloud storage credentials are available in the alluxio
- Run the below commands in seperate shell clients
- Alluxio Master to be the main coordinator
bash alluxio-master.sh
- Alluxio worker to execute task from master
- More then 1 workers can be spawned
bash alluxio-worker.sh
docker exec -it alluxio-master alluxio fs mount --option fs.gcs.credential.path=credentials.json /lta-datamall gs://lta-datamall/
docker exec -it alluxio-master alluxio fs ls /
Hive metastore will be used with Presto to serve catalog information such as table schema
- Below line has to be added into
- Jar file can be found in
directory - This allows hive to recognise the
export HIVE_AUX_JARS_PATH=${ALLUXIO_HOME}/client/alluxio-2.7.1-client.jar:${HIVE_AUX_JARS_PATH}
- Next edit the
- Ensure that the below property is set to the alluxio hostname and port
- To start the hive metastore 2 commands have to be ran
- Command below is to initalise a new metastore
${HIVE_HOME}/bin/schematool -dbType derby -initSchema
- Command below will server the metastore at port 9083
${HIVE_HOME}/hcatalog/sbin/hcat_server.sh start
- Navigate into the
directory and run the following command
docker build -f Dockerfile -t data_platform/hive:latest .
- Run the below commands in shell client
bash start-hive.sh
Trino is a distributed SQL Query engine able to federate access from a variety of data sources. Some of these sources are :
- MySql
- Postgres
- AWS S3, GCS, Azure Blob
- Alluxio
- This property file contains config specific to each node.
- Property is located in
- Reference : https://trino.io/docs/current/installation/deployment.html#node-properties
- Below is a minimal property file
- This property file contains a list of cli options for launching the JVM
- Property is located in
- Reference : https://trino.io/docs/current/installation/deployment.html#jvm-config
- Below is a good start jvm config
- This property file contains the config for the Trino Server
- Property is located in
- Reference : https://trino.io/docs/current/installation/deployment.html#config-properties
- Below is the config for the standalone server which this guide is using
- This property file contains the config for the hive connector
- Property is located in
- Reference : https://trino.io/docs/current/connector/hive.html#configuration
- Below is the config for the trino to mount the hive catalog
Spark is a multi purpose cluster computing engine.
- Jar dependencies of other applications can be added to spark to allow interaction between them
- Jars can be distributed either through
spark-submit --jars <comma seperated list of jar paths>
- adding the jars directly to the
${SPARK_HOME}/jars directory
- In this example the 2 required jars are copied using the latter
Alluxio client jar allows Spark to interact with the Alluxio FS${SPARK_HOME}/jars/trino-jdbc-367.jar
Trino JDBC connector allows Spark to make a JDBC connection to Trino
- Navigate into the
directory and run the following command
docker build -f Dockerfile -t data_platform/spark:latest .
- Run the below commands in shell client
bash start-spark.sh
- A catalog is equivalent to a connecter
- The catalog name is derived from the
docker exec -it trino trino --catalog hive --debug
- Create a schema to isolate the tables within the bucket
CREATE SCHEMA hive.lta_datamall
WITH (location = 'alluxio://alluxio-master:19998/lta-datamall/');
- Create a table on top of the file location
- Location can be pointed either
- directly to the file or
- the directory where file is located (Note that if more then 1 file is in a directory, all the files will be considered to be a table)
CREATE TABLE hive.lta_datamall.raw_buses_age_distribution (
year varchar,
age varchar,
number varchar
) WITH (
external_location = 'alluxio://alluxio-master:19998/lta-datamall/raw/buses_age_distribution',
- Since the table is pointing to the file location
- Data should appear similar to the structure of the file
SELECT * FROM hive.lta_datamall.raw_buses_age_distribution;
- Create a table over the 'refined' file in the bucket
- Bug in trino where the directory must exists first
docker exec -it alluxio-master alluxio fs mkdir /lta-datamall/refined/buses_age_distribution
CREATE TABLE hive.lta_datamall.buses_age_distribution (
number INTEGER,
) WITH (
external_location = 'alluxio://alluxio-master:19998/lta-datamall/refined/buses_age_distribution',
partitioned_by = ARRAY['year']
- Insert the data
INSERT INTO hive.lta_datamall.buses_age_distribution
cast(age as varchar),
cast(number as integer),
cast(year as integer)
FROM hive.lta_datamall.raw_buses_age_distribution;
- Read file from alluxio path into DF
df = spark.read.csv("alluxio://alluxio-master:19998/lta-datamall/raw/buses_age_distribution/", header=True, inferSchema=True)
- Rewrite file into the refined table path
- Note that partition column has to be last in col order
df.select("age_years", "number", "year")\
.write.parquet("alluxio://alluxio-master:19998/lta-datamall/refined/buses_age_distribution/", mode="overwrite", partitionBy="year")
- Create table on top of the directory
CREATE TABLE hive.lta_datamall.buses_age_distribution (
number INTEGER,
) WITH (
external_location = 'alluxio://alluxio-master:19998/lta-datamall/refined/buses_age_distribution',
partitioned_by = ARRAY['year']
SELECT * FROM hive.lta_datamall.buses_age_distribution