Intro to big data, HDInsight (Hadoop) on Azure

Hadoop

TLDR:

HDInsight is Microsoft’s flavor of hadoop. It’s used to collect massive amounts of data at once, and worry about how it gets stored, sorted, and visualized at a later time.

I’m giving a talk to MBAs at Villanova University later this week, and I’m covering HDInsight. In preparing my notes, I wrote an outline that I thought would come in handy for folks starting off. Feel free to grab any of it.

Additonally, there is an excellent overview video on Channel 9, where you can also find the slide deck. I’ve gone ahead and embedded the video at the bottom of these notes, too.

Hadoop for Windows

  • Hadoop as a service in Azure
  • Powered by Hortonworks Data Platform (HDP)
    • Set of components that come together to create Hadoop
  • Gives the Windows community the same power that the web companies were offering (Google, Yahoo) to Windows Server, and not just Linux

Hortonworks Data Platform

  • Comprehensive set of components
    • Distributed Storage and Processing
    • Redundant
    • Secure
    • Processing SQL-like queries

Microsoft Big Data Ecosystem

One end-to-end big data platform

  • Machine Learning
    • Cloud based predictive analytics
  • Event Hub
    • Collect Data
    • Log millions of events in near real-time
  • Notification Hubs
    • Send notifications to any platform on any backend
      • nodeJS, .NET
      • Mobile devices, web
    • Designed to scale, millions of customers at once
  • Stream Analytics
    • Filter/analyze/aggregate data
    • Event processing engine over streaming data in the cloud
    • Real time analytics

Architecture differences (relational vs Hadoop)

  • Relational (SQL)
    • Prescribing a schema  immediately (tables, columns)
    • Writes are slower, reads are quick
    • Optimized for normalizing data and processing a pattern
  • Hadoop
    • Doing things at scale
      • Lots of info from many devices at once
    • When loading data in:
      • Raw data
      • Prescribe a schema later, when you know exactly how you want to use it
      • Allows for different kinds of processing at a later time
      • Writes are quick, reads are slow]
      • Optimized for storing in raw form

Use Cases

SQL vs Hadoop

  • Relational (SQL)
    • Operational data store
    • Interactive OLAP Analytics
      • (Online Analytical Processing) – Extract and view data from different points-of-view
      •  Three basic analytical operations: consolidation (roll-up), drill-down, and slicing and dicing
    • Complex ACID transitions
  • Hadoop
    • Looking through log files or text data
    • Data discovery
    • Massive storage / processing in a cost effective way to scan later

Hadoop Develop Choices

  • HDInsight Service
    • Benefits of Azure: Elasticity, low cost, no infrastructure needed
  • HDInsight Server
    • Hadoop w/ MSFT tools on your on-premises server (download & install)

How Hadoop is architected

  • Distributed Storage and Processing for large scale applications
  • Distributed File System at its lowest level (HDFS)
    • Even if multiple machines go down, you won’t lose data
    • “Striping” (Remember RAID on your hard drives?)
      • Replicated across multiple racks
    • Self-healing (Should I call you Logan, or Weapon X?)
  • MapReduce
    • Runs large data processing jobs in parallel across many nodes
    • Later, combines those results
    • Commander:
      • “You out and gather intel! Report to me when you’re done!”
      • Commander then handles the paperwork, writes the report
    • Blob storage
      • Most cost effective way to storage data as binary (0s & 1s)

.NET MapReduce for Hadoop

  • Write MapReduce functions using .NET in Visual Studio
  • Develop a program and submit it to the Hadoop cluster
  • .NET Hadoop SDK -> Job Tracker
    • Runs your code across the cluster

Enterprise Data Services

WebHDFS: Web services interface for HDFS

  • One entry point for loading data into the system
    • Use a RESTful API to stream your data into HDFS (ie – log files)
    • Perform file & directory functions

SQOOP

  • SQL to Hadoop
    • Connect a relational DB and point to  table
    • Can go from SQL to Hadoop, or the other way around (Bi-directional)

Hive

  • SQL interface for Hadoop
    • SQL-like interface that enables data summaries, ad-hoc query, & analysis of large datasets

Hcatalog

  • Access to Hadoop data as a set of tables w/ out concern for where/how data is stored
  • Deep interop & data access between external services

Pig

  • Extract-transform-load (ETL) data pipelines,
  • Research on raw data, and
  • Iterative data processing.

Oozie

  • Workflow & scheduling system
  • Coordinates jobs written in multiple languages (MapReduce, Pig, Hive)
  • Allows specification of order & dependencies b/t jobs, so process can happen in an ordered manner

Data Refinery

Capture Data -> Process -> Exchange

  • Data Sources
    • Web logs, sensor data, social media
  • Process (systems)
    • Parse, cleanse, apply structure, & transform (Hadoop)
  • Applications
    • Business Analytics
    • Enterprise apps
    • Output to Excel
    • Visualizations

 

 

 

-----------------------


subscribe-to-youtube

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.