Intro to big data, HDInsight (Hadoop) on Azure

TLDR:

HDInsight is Microsoft’s flavor of hadoop. It’s used to collect massive amounts of data at once, and worry about how it gets stored, sorted, and visualized at a later time.

I’m giving a talk to MBAs at Villanova University later this week, and I’m covering HDInsight. In preparing my notes, I wrote an outline that I thought would come in handy for folks starting off. Feel free to grab any of it.

Additonally, there is an excellent overview video on Channel 9, where you can also find the slide deck. I’ve gone ahead and embedded the video at the bottom of these notes, too.

Hadoop for Windows

Hadoop as a service in Azure
Powered by Hortonworks Data Platform (HDP)
- Set of components that come together to create Hadoop
Gives the Windows community the same power that the web companies were offering (Google, Yahoo) to Windows Server, and not just Linux

Hortonworks Data Platform

Comprehensive set of components
- Distributed Storage and Processing
- Redundant
- Secure
- Processing SQL-like queries

Microsoft Big Data Ecosystem

One end-to-end big data platform

Machine Learning
- Cloud based predictive analytics
Event Hub
- Collect Data
- Log millions of events in near real-time
Notification Hubs
- Send notifications to any platform on any backend
  - nodeJS, .NET
  - Mobile devices, web
- Designed to scale, millions of customers at once
Stream Analytics
- Filter/analyze/aggregate data
- Event processing engine over streaming data in the cloud
- Real time analytics

Architecture differences (relational vs Hadoop)

Relational (SQL)
- Prescribing a schema immediately (tables, columns)
- Writes are slower, reads are quick
- Optimized for normalizing data and processing a pattern
Hadoop
- Doing things at scale
  - Lots of info from many devices at once
- When loading data in:
  - Raw data
  - Prescribe a schema later, when you know exactly how you want to use it
  - Allows for different kinds of processing at a later time
  - Writes are quick, reads are slow]
  - Optimized for storing in raw form

Use Cases

SQL vs Hadoop

Relational (SQL)
- Operational data store
- Interactive OLAP Analytics
  - (Online Analytical Processing) – Extract and view data from different points-of-view
  - Three basic analytical operations: consolidation (roll-up), drill-down, and slicing and dicing
- Complex ACID transitions
Hadoop
- Looking through log files or text data
- Data discovery
- Massive storage / processing in a cost effective way to scan later

Hadoop Develop Choices

HDInsight Service
- Benefits of Azure: Elasticity, low cost, no infrastructure needed
HDInsight Server
- Hadoop w/ MSFT tools on your on-premises server (download & install)

How Hadoop is architected

Distributed Storage and Processing for large scale applications
Distributed File System at its lowest level (HDFS)
- Even if multiple machines go down, you won’t lose data
- “Striping” (Remember RAID on your hard drives?)
  - Replicated across multiple racks
- Self-healing (Should I call you Logan, or Weapon X?)
MapReduce
- Runs large data processing jobs in parallel across many nodes
- Later, combines those results
- Commander:
  - “You out and gather intel! Report to me when you’re done!”
  - Commander then handles the paperwork, writes the report
- Blob storage
  - Most cost effective way to storage data as binary (0s & 1s)

.NET MapReduce for Hadoop

Write MapReduce functions using .NET in Visual Studio
Develop a program and submit it to the Hadoop cluster
.NET Hadoop SDK -> Job Tracker
- Runs your code across the cluster

Enterprise Data Services

WebHDFS: Web services interface for HDFS

One entry point for loading data into the system
- Use a RESTful API to stream your data into HDFS (ie – log files)
- Perform file & directory functions

SQOOP

SQL to Hadoop
- Connect a relational DB and point to table
- Can go from SQL to Hadoop, or the other way around (Bi-directional)

Hive

SQL interface for Hadoop
- SQL-like interface that enables data summaries, ad-hoc query, & analysis of large datasets

Hcatalog

Access to Hadoop data as a set of tables w/ out concern for where/how data is stored
Deep interop & data access between external services

Pig

Scripting platform for processing and analyzing large data sets

Extract-transform-load (ETL) data pipelines,
Research on raw data, and
Iterative data processing.

Oozie

Workflow & scheduling system
Coordinates jobs written in multiple languages (MapReduce, Pig, Hive)
Allows specification of order & dependencies b/t jobs, so process can happen in an ordered manner

Data Refinery

Capture Data -> Process -> Exchange

Data Sources
- Web logs, sensor data, social media
Process (systems)
- Parse, cleanse, apply structure, & transform (Hadoop)
Applications
- Business Analytics
- Enterprise apps
- Output to Excel
- Visualizations

-----------------------
@DaveVoyles

Dave Voyles | Software Engineer, Microsoft

Machine Learning and game development

Intro to big data, HDInsight (Hadoop) on Azure

TLDR:

Hadoop for Windows

Hortonworks Data Platform

Microsoft Big Data Ecosystem

Architecture differences (relational vs Hadoop)

Use Cases

Hadoop Develop Choices

How Hadoop is architected

.NET MapReduce for Hadoop

Enterprise Data Services

Data Refinery

Leave a Reply Cancel reply

TLDR:

Hadoop for Windows

Hortonworks Data Platform

Microsoft Big Data Ecosystem

Architecture differences (relational vs Hadoop)

Use Cases

Hadoop Develop Choices

How Hadoop is architected

.NET MapReduce for Hadoop

Enterprise Data Services

Data Refinery

Related posts

Leave a Reply Cancel reply