Takeaways from Building Intelligent Data Lakes Workshop

6 min readJun 8, 2019

One of the greatest assets of living near a metropolitan city and a growing tech hub is the easy access to great events and meetups. Last Thursday, I had the opportunity to attend an event hosted by Slalom, AWS, and Women in Big Data titled “AWS: Building Intelligent Data Lakes” in Arlington, VA.

Women in Big Data (WiBD) is a volunteer-based forum that connects and celebrates female talent in the field of big data and analytics. It began with 15 members from Intel and has expanded to a member count of over 8,500. The American Association of University Women (AAUW) reported in 2015 that only 12% of working engineers were women and barely 18% of all computing degrees went to women. WiBD is working to bridge the gender gap that exists and to grow women representation in the field of Big Data to 25% by 2020.

Slalom is a modern consulting firm that blends design, engineering, and analytics to help businesses successfully implement strategies and brand goals. They offer a list of services including cloud integration, AI solutions, UI/UX design, and data analysis. Their Practice Area Director, Paul Beda, shared examples of customer success stories (which I write more about later in the post) that involved the use of big data and AWS tools.

A subsidiary of Amazon, Amazon Web Services (AWS) offers cloud computing web services and platforms. It provides over 165 services including compute, storage, databases, analytics, machine learning and artificial intelligence, application development, deployment, and management.

Rielah De Jesus — a Solutions Architect at AWS —explained how data should be regarded as a strategic asset by all companies because of its ability to create robustly predictive models. With good reason because Forbes predicts the revenue of the big data market to be $200 billion by 2020. However, many companies are oblivious to the potential of their isolated data and even then, 85% of big data projects fail. Why? One of the biggest reason is the hefty cost and challenge of moving data across silos.

Enter, Data Lake.

Data lake is a way to store and analyze massive volumes and heterogeneous types of data. Through this method, all data is held in a central repository in its native format, meaning that it is highly accessible and can be quickly ingested and transformed. In short, data lakes facilitate easy usage of data and brings functionality to it.

AWS provides an array of services that together enables the lifecycle of data lakes. These products address four key characteristics:

Collect anything: can collect any type of data, from multiple sources (i.e. connected devices, web logs/cookies, social media, transactions, etc…)
Dive in anywhere: can go in at any point within data lake to retrieve info
Flexible access: grants access to any and all departments and authorized individuals making use of the data & allows for integration of other AWS tools
Future proof: withstands time and new technology

De Jesus told us that AWS products can be used like lego pieces in building various architectures of data analysis.

Example of how multiple AWS products can be linked together to collect, store, and analyze data

After an introduction to AWS, we were given materials to a three-part workshop that walked us through building our own data lake and analysis architecture.

Lab 1: Hydrate the data lake using Kinesis Platform

This lab consisted of using Kinesis Streams to collect and store streaming IoT sensor data, then using Kinesis Analytics to analyze the streaming data in real-time. We used a built-in, machine learning algorithm called RANDOM_CUT_FOREST, which serves to detect anomalies in streaming data. In the final portion of this lab, we used Kinesis Firehose to export data in both raw and processed forms into S3 — a storage service that allows for further analysis. This lab highlighted the flexibility of Amazon Kinesis in allowing multiple consumers to read and process data from the same stream. In our case, our first consumer was Kinesis Analytics, which analyzed the data in real-time and stored the result in S3, and our second consumer was Kinesis Firehose, which stored raw data directly in S3.

Lab 2: Catalog, Transform and Visualize using Glue, Athena, EMR, Redshift Spectrum and QuickSight

In Lab 2, we populated the AWS Glue Data Catalog with our data stored in the S3 bucket from Lab 1. The data catalog holds references to the location, schema, and runtime metrics of data. AWS Glue uses these references to target, extract, and transform data. We learned that columnar store formats, such as Parquet (as opposed to CSV), are optimal when it comes to storing data because it means fewer bytes to read and better query performance. After transforming the CSV data (retrieved from S3) into Parquet using AWS Glue, we used Athena to query the data and QuickSight to visualize the data in various graph forms. We were also given the option to use EMR (Elastic MapReduce) and Redshift Spectrum to query the data.

Lab 3: SageMaker

The final lab involved the integration of Amazon SageMaker to train, create, and host a machine-learning model that detects anomalies. We prepared the data with AWS Glue, then used a built-in machine learning algorithm called K-Means algorithm on the transformed data. This function invokes an endpoint for prediction, which can be used to program reactive measures to anomalies in our applications.

Slalom’s Practice Area Director, Paul Beda, shared a customer success story about their client Avis Budget Group that helped to illustrate a practical use of data and analysis tools like the ones provided by AWS. The sudden upsurge in ridesharing services like Uber and Lyft hit car rental companies hard and Avis Budget Group needed a smart way to keep inventory of vehicles and cut down on unnecessary costs. Slalom compiled an extensive amount of data on how the vehicles were being used (i.e. they found that some were being rented out by Uber drivers who were driving back and forth within a defined space) and came up with an efficient and feasible system of garage organization. This solution involved parking the most utilized vehicles in portions of their garages with easier access. Beda also talked about innovative ways of using and selling data. For instance, cars with the ability to detect potholes or certain weather conditions (i.e. when windshields are on) can relay that information to city halls or weather forecast systems. Smart data usage can also actualize an age of personalization. We can imagine a near future with rental cars and hotel rooms that preset the temperature and radio or TV channels based on previous preferences.

Panelists discussing topics of cloud computing and diversity

One of the panelists, Amy Tseng, a Database Admin. Manager at Fannie Mae, said she led her team in the transition to cloud because of some of the challenges they faced regarding scalability, concurrency, provisioning, and security. Another key issue was the time it took to transport big data. AWS provided solutions for all these challenges, including a cut in transport time from two weeks to a couple of hours. With a growing list of available products and services, AWS is sure to provide feasible solutions for many business needs and agendas.

I want to thank AWS, Slalom, and Women in Big Data for an amazing afternoon of learning and for their dedication to supporting and growing women representation in the tech industry. It was an invaluable experience listening to some of these female leaders and learning how to leverage AWS services when working with big data.

Takeaways from Building Intelligent Data Lakes Workshop

Lab 1: Hydrate the data lake using Kinesis Platform

Lab 2: Catalog, Transform and Visualize using Glue, Athena, EMR, Redshift Spectrum and QuickSight

Lab 3: SageMaker

Written by Robin Kim