I have extensive experience working on Hadoop ecosystem: Hive, Spark, Sqoop, HBase, Redshift, Oozie, Storm, Impala, Kylin etc. Also, MongoDB, Cassandra, Spark SQL, Spark ML lib, Spark Streaming
Q1. Please provide one (small / medium ) use case of your ETL work in detail.
1. Web Scrappers -> Kafka -> Elasticsearch -> Kibana [ 10^7 logs in 1PB data per day]
2. Twitter -> Python Producers -> Pyspark EMR -> Elasticsearch + Redshift -> Tableau [1GB per day]
3. Radio API -> Kinesis -> Hive Map Reduce -> MySQL [2 GB per day]
4. Web API + Mobile API -> Kafka -> Python Consumers -> Teradata -> Power BI [4 GB per day] less
Q2. If there are some X number of customers and they made some purchases. Can you write an SQL to find out the TOP 5 customers who made the most purchases.
SELECT top 5 custid, COUNT(distinct orderid) AS 'Purchases'
FROM orders
GROUP BY custid
ORDER BY 2 DESC
Q3. What did you use Map Reduce for? What did you use PIG for ? What did you use Hive for ? where does the data gets transformed? / While performing transformations where is the data?
Map Reduce is a framework, it was used in the use case #3 above.
I have not worked on PIG but understand how it is different from Hive.
I have used Hive in use case #3 above.
During Map Reduce the data is mapped (read) from a storage (usually HDFS or S3 bucket) and Reduced (transformed, aggregated etc.) using Hive Storage. On the other hand, if we do the same in Spark, it happens in memory.