Thursday, July 7, 2016

Hadoop

Hadoop is a a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
There are four main modules in Hadoop.
  1. Hadoop Common: The common utilities that support the other Hadoop modules.
  2. Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
  3. Hadoop YARN: A framework for job scheduling and cluster resource management.
  4. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
Before going further, Let's note that we have three different types of data.
  • Structured: Structured data has strong schema and schema will be checked during write & read operation. e.g. Data in RDBMS systems like Oracle, MySQL Server etc.
  • Unstructured: Data does not have any structure and it can be any form - Web server logs, E-Mail, Images etc.
  • Semi-structured: Data is not strictly structured but have some structure. e.g. XML files.
Depending on type of data to be processed, we have to choose right technology.
Some more projects, which are part of Hadoop:
  • HBase™: A scalable, distributed database that supports structured data storage for large tables.
  • Hive™: A data warehouse infrastructure that provides data summarization and ad-hoc querying.
  • Pig™: A high-level data-flow language and execution framework for parallel computation.
Hive is better than PIG in: Partitions, Server, Web interface & JDBC/ODBC support.
Some differences:
  1. Hive is best for structured Data & PIG is best for semi structured data
  2. Hive is used for reporting & PIG for programming
  3. Hive is used as a declarative SQL & PIG as a procedural language
  4. Hive supports partitions & PIG does not
  5. Hive can start an optional thrift based server & PIG cannot
  6. Hive defines tables beforehand (schema) + stores schema information in a database & PIGdoesn't have a dedicated metadata of database
  7. Hive does not support Avro but PIG does. EDIT: Hive supports Avro, specify the serde as org.apache.hadoop.hive.serde2.avro
  8. Pig also supports additional COGROUP feature for performing outer joins but hive does not. But both Hive & PIG can join, order & sort dynamically.
HBASE won't replace Map Reduce. HBase is scalable distributed database & Map Reduce is programming model for distributed processing of data. Map Reduce may act on data in HBASE in processing.
You can use HIVE/HBASE for structured/semi-structured data and process it with Hadoop Map Reduce
You can use SQOOP to import structured data from traditional RDBMS database Oracle, SQL Server etc and process it with Hadoop Map Reduce
You can use FLUME for processing Un-structured data and process with Hadoop Map Reduce
Have a look at: Hadoop Use Cases.
Hive should be used for analytical querying of data collected over a period of time. e.g Calculate trends, summarize website logs but it can't be used for real time queries.
HBase fits for real-time querying of Big Data. Facebook use it for messaging and real-time analytics.
PIG can be used to construct dataflows, run a scheduled jobs, crunch big volumes of data, aggregate/summarize it and store into relation database systems. Good for ad-hoc analysis.
Hive can be used for ad-hoc data analysis but it can't support all un-structured data formats unlike PIG.

Real time processing is needed where fast decisions need to be taken in case of Fire alarm send by sensor or fraud detection in case of banking transactions. Batch processing is needed to summarize data which can be feed into BI systems.
we used Hadoop ecosystem technologies for above applications.
Real Time Processing
Apache Storm: Stream Data processing, Rule application
HBase: Datastore for serving Realtime dashboard
Batch Processing Hadoop: Crunching huge chunk of data. 360 degrees overview or adding context to events. Interfaces or frameworks like Pig, MR, Spark, Hive, Shark help in computing. This layer needs scheduler for which Oozie is good option.
Event Handling layer
Apache Kafka was first layer to consume high velocity events from sensor. Kafka serves both Real Time and Batch analytics data flow through Linkedin connectors.

Wednesday, July 6, 2016

Apache Spark



RDD stands for Resilient DistributedDatasets, which is a collection of fault-tolerant operational elements that run in parallel

Java Lambda Expression

A lambda expression is an unnamed function with parameters and a body.
The lambda expression body can be a block statement or an expression.
-> separates the parameters and the body.
(int x) -> x + 1 takes an int parameter and returns the parameter value incremented by 1.
(int x, int y) -> x + y takes two int parameters and returns the sum.
(String msg)->{System.out.println(msg);} takes a String parameter and prints it on the standard output.
msg->System.out.println(msg) takes a parameter and prints it on the standard output. It is identical to the code above.
() -> "hi" takes no parameters and returns a string.
(String str) -> str.length() takes a String parameter and returns its length.
The following lambda takes two int parameters and returns the maximum of the two.
(int x, int y)  ->  {  
    int max = x  > y  ?  x  : y;
    return max;
}

Sunday, July 3, 2016

Concurrent API

Synchronizers :
  • Semaphore controls access to shared resources. A semaphore maintains a counter to specify number of resources that the semaphore controls.
  • CountDownLatch allows one or more threads to wait for a countdown to complete.
  • The Exchanger class is meant for exchanging data between two threads. This class is useful when two threads need to synchronize between each other and continuously exchange data.
  • CyclicBarrier helps provide a synchronization point where threads may need to wait at a predefined execution point until all other threads reach that point.
  • Phaser is a useful feature when few independent threads have to work in phases to complete a task.
Executor Frame Work:


Spring MVC Exception Codes

BindException400 - Bad Request
ConversionNotSupportedException500 - Internal Server Error
HttpMediaTypeNotAcceptableException406 - Not Acceptable
HttpMediaTypeNotSupportedException415 - Unsupported Media Type
HttpMessageNotReadableException400 - Bad Request
HttpMessageNotWritableException500 - Internal Server Error
HttpRequestMethodNotSupportedException405 - Method Not Allowed
MethodArgumentNotValidException400 - Bad Request
MissingServletRequestParameterException400 - Bad Request
MissingServletRequestPartException400 - Bad Request
NoSuchRequestHandlingMethodException404 - Not Found
TypeMismatchException400 - Bad Request