Understanding Big Data -Volume, Velocity and Variety

Data Sizes and Names

A byte is a sequence of 8 bits processed as a single unit of information. A single letter or character would use one byte of memory (8 bits), two characters would use two bytes (16 bits). A bit is either an 'on' or an 'off' which is processed by a computer processor, we represent 'on' as '1' and 'off' as '0'. 8 bits are known as a byte, and it is bytes which are used to pass our information in it's basic form - characters. An alphanumeric character (e.g. a letter or number such as 'A', 'B' or '7') is stored as 1 byte. For example, to store the letter 'R' uses 1 byte, which is stored by the computer as 8 bits, '01010010'.

A kilobyte (KB) is 1024 bytes, a megabyte (MB) is 1024 kilobytes and so on as these tables demonstrate.

As of 2012, about 2.5 exabytes of data are created each day, and that number is doubling every 40 months or so. More data cross the internet every second than were stored in the entire internet just 20 years ago. This gives companies an opportunity to work with many petabyes of data in a single data set—and not just from the internet. For instance, it is estimated that Walmart collects more than 2.5 petabytes of data every hour from its customer transactions. A petabyte is one quadrillion bytes, or the equivalent of about 20 million filing cabinets’ worth of text. An exabyte is 1,000 times that amount, or one billion gigabytes.

Data Size
Equal To
Size in Bytes
approx size in Bytes
Kilobyte (KB)1024 Bytes1,024$10^3$
Megabyte (MB)1024 Kilobytes1,048,576$10^6$
Gigabyte (GB)1024 Megabytes1,073,741,824$10^{9}$
Terrabyte (TB)1024 Gigabytes1,099,511,627,776$10^{12}$
Petabyte (PB)1024 Terrabytes1,125,899,906,842,624$10^{15}$
Exabyte (EB)1024 Petabytes1,152,921,504,606,846,976$10^{18}$
Zetabyte (ZB)1024 Exabytes1,180,591,620,717,411,303,424$10^{21}$
Yottabyte (YB)1024 Zetabytes1,208,925,819,614,629,174,706,176 $10^{24}$
Xenottaobyte1024 Yottabytes1,237,940,039,285,380,274,899,124,224$10^{27}$
Shilentnobyte 1024 Xenottabytes1,267,650,600,228,229,401,496,703,205,376$10^{30}$
Domegemegrottebyte 1024 Shilentnobytes1,298,074,214,633,706,907,132,624,082,305,024$10^{33}$

  • 2 Megabytes: A high resolution photograph
  • 20 Gigabytes: A good collection of the works of Beethoven OR 5 Exabyte tapes OR A VHS tape used for digital data
  • 10 Terabytes: The printed collection of the US Library of Congress
  • 2 Petabytes: All US academic research libraries
  • 5 Exabytes: All words ever spoken by human beings.

Data Volumes

Three V's - Volume, Velocity, and Variety

Scientific reports on big data seem to borrow terms from an alien language. Consider the large amount of data generated when experiments are running at the Large Hadron Collider. This enormous scientific instrument in Geneva, Switzerland, can generate 42 terabytes of data a day.

The National Climatic Data Center in Asheville, N.C., stores more than 6 petabytes of climate data from ships, buoys, weather balloons, radars, satellites and computer models. By 2020, the center expects to have 20 petabytes.

What is Variety?

By definition of the word Variety, it means different kinds of data. This exciting concept within big data gives you the opportunity to gain insight by combining a variety of data sets that would not traditionally sit together. By enabling you to link up your traditional analytical data sets with many different types of information, a new world of analytical possibilities is opened.

So what is so exciting about this?

Well, it allows you to collate data sets that don't obviously relate to each other. Data experts can then analyze this collated data, to spot patterns or create new insights you would previously had no clue about. Variety, when tackled well in big data, allows you to see new revelations in the data your organization already produces.

An example: Jennifer is a brand manager, she loves her job and is very good at it, but knows she would benefit from being able to listen even more closely to the voice of her customer. Taking traditional financial information, Jennifer can already see the performance of her brand. It doesn't take a data scientist to see which week did well, and which week did badly. But it won't tell her why.

Harnessing variety in data, Jennifer's data team can create relations between this data and what's being said on social media about her brand, as well as in text-input fields on customer satisfaction surveys. These disparate sets of data can be brought together, contextualized, and visualized in a way that gives Jennifer clues as to what her brand has done to influence customer behavior.

Suddenly, Jennifer now has the vision to generate hypotheses on ways to amplify positive results and mitigate negative trends. Most importantly, she can take action.

What is Velocity?

Big Data Velocity deals with the pace at which data flows in from sources like business processes, machines, networks and human interaction with things like social media sites, and mobile devices. The flow of data is massive and continuous. This real-time data can help researchers and businesses make valuable decisions that provide strategic competitive advantages and ROI if you are able to handle the velocity. Inderpal suggest that sampling data can help deal with issues like volume and velocity.

Use of Big Data

Big Data has the potential to transform the way you can run the organization. When used propery it will create new insights and effective ways of doing business such as

  • How to design and deliver products
  • How your customers can find and interact with you
  • How can you produce insights in real-time

More About HDFS

About Hadoop