Thursday, January 5, 2012

Big Data Technologies and players

Big Data technology/players have four key layer/areas:

Following are some player in respective area. This list is in no means complete, these are some known players in each area.

  • Infrastructure - They key to big data infrastructure is easy scalability to handle PetaBytes of data, so the cloud become a natural choice. So you will see a lot public cloud providers in the graphic below in this area. 
  • Data Storage - The traditional methods of storage (i.e. RDBMS) is not a good option for it's price and scalability restrictions. So the new methods of storage, particularly NoSQL and DFS(distributed file system) is the paradigm shift in storage arena. Among these Hadoop HDFS is most commonly used storage for Big Data, however there are other storage are also used depending upon the use case. 
  • Data Processing and Management - In this area Hadoop Mapreduce stand out, as this is the framework that is used for processing massing amount of data in parallel. Products Vendors also have implemented the same principles into their products. 
  • Data Analytics - This is the area where lot of old vendors are still in play to provide visualization and    predictive analytics. Hadoop project is also implementing libraries like Mahout but are not as mature as some of thee product vendors are.  Another are that has emerged because of Big Data is "Dataset Providers". There players also provide public datasets which you can simply download and use for your analytics. You can get some 
Following are some public data providers. 

For deep dive into each of the technology areas I am writing hadoop blogs. Please refer to those for details.

Tuesday, August 23, 2011

What is Big Data


Definitions:
“Big Data” is when size of the data itself becomes part of the “problem”

“Big Data” is a term applied to data sets that are large, complex and dynamic, and is beyond the ability of commonly used software tools to capture, manage, and process it within a tolerable elapsed time.

“Big Data” is also used as an umbrella term for ecosystem of processing and storage of large, complex and dynamic data using Map/Reduce processing and NoSQL storage techniques.


Dimension of Big Data:
 Volume (amount of data), Velocity(rate of data in/out) and Variety(range of data types).

Some facts:

  • A recent IDC study projected that the total volume of electronically stored data and files - the digital universe - would reach 1.2 zettabytes in 2010. That's 21 zeros behind the 1, if you're keeping count. 
  • Traditional DW/BI tools are capable of handling 5TB of data at a time.

Some of you may ask,"This data has been there for a long time, what's the big deal now?"

The answer to that is:
Need of massive processing resources and storage posed biggest challenge in  putting this data to use. With access to cheap storage and large/scalable  computing infrastructure, the significance of the dormant data has increased many fold. And there are new technologies available to give insight into un-structure data.
Usages of Big Data:
1.Information transparency and usability at much higher frequency
2.More accurate and detailed performance information
3.Ever-narrower segmentation of customers

So let's re-define Big-Data:
"Big Data technologies describe a new generation of technologies and architecture, designed to economically extract value from very large volumes of wide variety of data by enabling high velocity capture, discovery and/or analysis" 


Having laid down the basics of understanding the Big Data. I will get a little deeper into ecosystem of Big Data in my next post.