What is Big Data?

Big Data is a popular topic these days. Probably, most of us have heard about topics like ‘Big Data Analysis’, ‘Data Mining’ and ‘BI’ which are quite common discussions in IT meet up’s or seminars. But, what is the meaning of ‘Big Data’? Which analysis tools are there to extract the hidden knowledge beyond them and which difficulties we’ll face with as IT professionals?

‘Big Data’ is used to describe a massive volume of both structured and unstructured data which is too large to process using traditional database and software techniques. In fact, in most enterprise scenarios the volume of data is too big or it moves too fast or it exceeds current processing capacity [1]. Basically, three characteristics of Big Data are:

  • high-volume (MB, GB, TB, PB);
  • high-velocity (Batch, Periodic, Real Time); and
  • high-variety (Database, Photo, Video, Web,etc) [2,3]

However, big data can’t be useful if we cannot analyse it.


Big Data Analysis Tools

Big data analysis is the process of examining large data sets including different data types to extract hidden patterns, unknown correlations, market trends and other useful business information [4]. Since big data takes too much time and budget to load into a traditional relational database for analysis, new approaches have emerged for storing and analysing data that mostly rely on machine learning and artificial intelligence (AI) programs [2]. Some of these methods are: ‘Advanced analytics’, ‘Predictive analytics’, ‘Data mining’, ‘Text analytics’ and ‘Statistical analysis’ [4]. Here we briefly indicate some analytical tools and technologies for big data:

  • Hadoop: Hadoop lets you store files bigger than what can be stored on one particular node or server. So you can store very, very large files. It also lets you store many, many files. Hadoop is an open source, free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment with high degree of fault tolerance. It is based on four core components [5]:
    • Hadoop Common: A module including the utilities in order to support the other Hadoop components.
    • Hadoop Distributed File System (HDFS): A file system that provides reliable data and process among all the nodes in Hadoop clusters.
    • MapReduce: is the heart of Hadoop. MapReduce is a framework for writing applications that process large amounts of structured and unstructured data in parallel across a cluster of thousands of machines, in a reliable, fault-tolerant manner.
    • Yet Another Resource Negotiator (YARN): The next-generation MapReduce, which assigns CPU, memory and storage to applications running on a Hadoop cluster [5].
    • The Hadoop framework is used by major companies like Google, Yahoo and IBM, for applications involving search engines and advertising.
  • Spark: is a fast and general engine for large-scale data processing. It is an open-source cluster computing framework with in-memory analytics performance that is up to 100 times faster than Hadoop MapReduce, depending on the application [5, 6].
  • Hive: A Hadoop runtime component which is designed for those ones who feel more comfortable with SQL, they can write Hive Query Language (HQL) statements which are similar to SQL statements [5].
  • Pig: A programming language designed to handle any type of data, helping users to focus more on analysing large data sets and less on writing map programs and reduce programs [5].
  • Sqoop: An ELT (Extract, Load, Transform) tool that supports data transformation between Hadoop and structured data sources [5].
  • Flume: Flume allows you to flow data from a source into your Hadoop environment. It’s a distributed, reliable and available service for efficiently collecting, aggregating and moving large amounts of log data [5].
  • HCatalog: A table and storage management service for Hadoop data that presents a table abstraction so the user does not need to know where or how the data is stored [5].
  • Avro: An Apache open source project that provides data serialization and data exchange services for Hadoop [5].
  • HBase (NOSQL database): HBase is a column-oriented database management system that runs on top of HDFS (Hadoop Distributed File System). It is appropriate for sparse data sets, which are common in many big data use cases. HBase isn’t a relational data store so it does not support a structured query language like SQL. HBase applications are written in Java much like a typical MapReduce application. HBase does support writing applications in Avro, REST, and Thrift [5].


Big Data Analysis Challenges

We have already mentioned some brilliant technologies to analyse the big data, nonetheless, big data analysis can be challenging. Figure 1 shows the results of a 2012 survey in the communications industry that identified the top four Big Data challenges as:

  1. Data integration: One of the main features of analysing big data is the ability to combine data which is different in source or structure at reasonable time and cost. So with such variety, the way that we manage and manipulate data quality comes into account, which can be one of the big challenges in analysing big data. We should meaningfully connect well understood data from our data warehouse with data that is less well understood [7].
  2. Data volume: The ability to process the volume at an acceptable speed so that the information is available to decision makers when they need it [7].
  3. Skills availability: There is a lack of internal analytics skills and the high cost of hiring experienced analytics elites [7].
  4. Solution cost: Since Big Data has opened up a world of possible business improvements, there is a plenty of experimentation and discovery occur to realize the patterns that matter and their value. Therefore, it is crucial to reduce the cost of the solutions used to find that value [7].



Figure 1: Biggest Challenges for success in Big Data analysis. Source: TM Forum, 2012 [7]

In spite of these problems, big data has the potential to help organisations improve operations and make faster, more intelligent decisions. In addition, the availability of new in-memory technology and high-performance analytics is providing a better way to analyse data more quickly than ever.

At Diversus we have proven examples of implementations for our clients where Big Data has made this real difference.