BIG DATA MANAGEMENT

  • Category:
    Logic & Programming
  • Document type:
    Assignment
  • Level:
    Undergraduate
  • Page:
    4
  • Words:
    2398

Big Data Management

Introduction

Big data can be described as large volumes of data that are both structured and unstructured. The concept of being data has evolved in the current century with most companies striving to deal with the data (WhatIs.com, 2017). In 2000s analysts, Dough Laney described the concept on big data regarding the Vs: Volume, Velocity, and Variety. Regarding volume, the analysts described that organizations could collect data from different sources that includes social media, business transactions and information from machine-to-machine data. The data is large and complicated to store due to its content (volume). In terms velocity, data are said to stream at an unprecedented speed which requires timely analysis. New methods of analyzing such data such as RFID tags and smart metering are used to deal with the big data in near-real time. Lastly, he described big data regarding variety saying that big data come in all types of format including structured and unstructured forms. Additionally, the data may come from audio, stock, emails, videos, financial transcripts and traditional text documents (Insights, 2017). This makes big data complex to handle.

The importance of big data is not about the quantity of data but information that can be derived from the data. Big data can be used to reduce costs of production, develop new items, making smart decisions and timely response. However, to achieve these benefits, data need to be analyzed. Analytical techniques need to be used to unveil the hidden patterns, realize customer preference, understand market trends and discover unknown correlations from the data. Data analytics technologies and techniques help in analyzing big data and drawing a conclusion. The main techniques for analyzing big data include regression analysis, sentiment analysis, genetic algorithms, association rule learning, social network analysis, machine analysis and classification tree analysis.

Analysis cannot be achieved without the data being managed. Data management considers the processes and technologies used to acquire and store data which should later be used for analysis. Similar to analytic techniques and technologies, data management can be done using various techniques. Some of the techniques include Hadoop, NoSQL, Hive, Cassandra, MapReduce, cloud services and parallel databases – to mention but just a few (NoSQL Database 2017). This paper will focus on NoSQL technique.

What is NoSQL

Not Only SQL and NoSQL are described as database designs that achieve key-vale store, column store, document store and graph format data. NoSQL is the alternative of the Structured Query Language (SQL) that was introduced in the 1980s (WhatIs.com. 2017). NoSQL defers from the other databases that apply SQL technique which stores data in table format and data schema designed carefully before building the database (Cattell, 2011). The target of NoSQL is large data that is distributed. Unlike other SQL techniques, NoSQL systems were built not to follow a given rational schema. For instance companies with large databases such as Google and Amazon have used NoSQL to focus on narrow operational goals.

Classification of NoSQL architecture

There are four main classifications of the NoSQL architecture that includes: Document database, graph database, key-value stores and wide column stores.

Key-value stores

The Key-value store of NoSQL database is a schema-less format like Riak that describes what is needed in the storage needs. It pairs that unique key with other associated key in a simple data model. Since Key value stores use simple models, it can be reformatted to be used in session management and caching web applications. The Key can be generated and the value used can either be String, BLOG or JSON or any other value (3Pillar Global, 2017).

The key value normally uses hash tables where a unique key exists. Additionally, there must be a pointer to a given set of data. Bucket in key value store is described as logical groups of keys. Buckets do not physically group data since there can be similar keys in different buckets. The performance of key-value store is made simple by having a cache mechanism of mappings. For one to be able to read any values, he/she must understand the key and the bucket since the real key uses hash (Bucket + Key).

Implementation of the key-value store is not complex making it not ideal when updating query or value in the database. Additionally, this database type has weakness of not providing any traditional database capabilities. For instance, it cannot be used to automate transactions or multiple simultaneous transactions. Furthermore, the Key-value store becomes complex to operate as the volume of data increases (Katsov, 2017). Some of the most popular key-value stores NoSQL databases are Riak and Amazon’s Dynamo.

Document databases

This database is also known as document stores, and it is used to store and describe semi-structured data in document format. Document databases are mostly used by programmers because they help them to create and update programs without necessary referring to master schema. The key value stored are similar to those of key-value stores, but the difference is that in document databases, structure and encoding must be provided to manage data (3Pillar Global, 2017). Some of the common standard encodings include BSON, JSON, and XML.

One example of document databases is the Apache CouchDB. This type of database uses JSON in data storage, JavaScript as the query language and uses MapReduce and HTTP for API. Data and relationships are not stored in table formats but as a collection of independent documents. The advantage of document databases is that schema-less structure makes it possible to add JSON documents without having to define the change first (Katsov, 2017). The most common types of databases that are used currently are MongoDB and Couchbase.

Wide column stores

Data in this architecture is organized in the form of columns and not rows. Wide columns are applicable in both NoSQL and SQL techniques. Every column is grouped into column families with each family containing an unlimited number of columns that are created in runtime or defined in a schema. One advantage of a wide-column database is that it can be used to query large volumes of data very fast than the rational techniques.

Most rational DBMS store data in rows. The columnar database, on the other hand, uses columns which make it easier to access data and data aggregation. Data is a wide-column model are stored in a single row of continuous disks. Disk entry makes it easy to access data faster. For instance looking for a title in a bunch of million articles in rational models would be hectic and tedious because it would require one to go over each location of rows to find the specific title. On the other hand, just accessing one disk in wide-column access will help obtain the title of all items (3Pillar Global, 2017).

The model used in column store includes having columnFamily, key, keyspace and column. ColumnFamily is defined as a single structure that groups columns and super columns easily. Key is the permanent names of record. There are different keys with a different number of columns in a single database. Keyspace defines the overall level of an organization, that is, the name of an application such as 3PillarDataBase. Lastly, the column is ordered list of elements known as a tuple that has name and value defined (Katsov, 2017). The most popular column-store based databases are Cassandra, Google’s BigTable and HBase.

Graph stores

In this architecture, data is organized as nodes (similar to records in relational databases) and edge (representing connections between nodes). Graph representation, unlike tables and columns, is used to address scalability issues. Additionally, it is easy to transform data from one model to another. Edges and nodes represent and store data and nodes and relationships have a given defined properties. One of the properties of the graph-based database is having labeled, directed and attributed multi-graph. This means that the graphs have nodes that are labeled with nodes having some relationships with one another that are indicated by directional edges (3Pillar Global, 2017).

Graph database stores are applicable in systems that must map relationships such as in customer relationship management and reservation systems. Rational database models can be used to replicate graphical models but joining edges would be costly and complex (MongoDB, 2017). Examples of graph database stores include Titan, AllegroGraph, Neo4j and IBM Graph.

Basic principles of NoSQL data modeling

Denormalization

The term means copying the same data in multiple tables or documents for the purposes of simplifying query processor fitting data in particular model. Most of the techniques described above utilize denormalization technique. Denormalization is useful in query data volume or IO per query (MongoDB, 2017). This means that denormalization can help group all data in a query in a single place. Additionally, denormalization helps in processing complexity. By this, denormalization helps in storing data in a query-friendly structure that making the process of the query to be easy.

Aggregates

All techniques of NoSQL enable soft schema techniques through key-value stores and graph databases, big table models and document databases. The soft schema is a method that allows one to create classes of entities that have an internal structure that is complex. Soft schema helps to minimize one-to-many relationship through nested entities and as a result reducing joins. The soft schema also helps to hide technical differences that exist between business entities and modeling of the entities using one table or documents (MongoDB, 2017).

Application of side joins

NoSQL rarely supports joins. Due to this, the question-oriented data are handled at the design time which is different from the rational model where question-oriented data are handled in execution time. Joins are most of the times inevitable, and NoSQL should be designed to handle them. Joins are majorly caused by many-to-many relationships that require joins (NoSQL Database, 2017).

Atomic aggregates

Many NoSQL solutions have limited transaction support. It is, therefore, important to model data using aggregates technique which will be able to achieve ACID properties. In rational models, multi-places updates are used in normalization. On the other hand, aggregates help one to store data of a business entity in a single document, key-value pair or row and updating it automatically. An atomic aggregate does not certain transactional solutions, but it guarantees test-and-set instructions, atomicity, and locks (Sadalage & Fowler, 2013).

Enumerable keys

One important benefits of NoSQL is that unordered key-value data can be partitioned by hashing the key for. Enumerable keys help to make sorting simpler and help in storage purposes (Sadalage & Fowler, 2013). For instance, NoSQL can be used to store messages using the userID_messageID key. Besides messages can be grouped into buckets which allow one to transverse emails/messages forward or backward from the current date.

Dimensionality reduction

The technique uses mapping multidimensional data to key-value or other non-multidimensional models. A good example of this technique is Geohash which is used in geographical fields to approximate distance between regions by the help of bit-wise code proximity. Dimensionality reduction technique makes it simple to understand stored data.

Nested sets

This technique uses the tree-like structure for modeling. It is perfectly applicable in document databases and key-value stores. The idea behind the technique is to store leafs of trees in the array and to map them in the non-leaf node. Leafs are arranged to start from end indexes (MongoDB, 2017). The other basic principles used in NoSQL are nested document flattening, batch graph processing, index table, composite key index, aggregation with composite keys, inverted search, tree aggregation, adjacency lists and material paths.

Benefits of NoSQL

NoSQL is more scalable and have a superior performance experience as compared to the rational database. Besides, NoSQL techniques address issues that rational models cannot address. These include handling large volumes of structured, unstructured and semi-structured data that are rapidly changing. Furthermore, NoSQL can handle frequent code pushes, agile sprints, and quick schema iteration. They can also be applicable to object-oriented programming. Lastly, NoSQL is used in geographically distribute scale-out that cannot be handled by rational techniques (Omidi & Alipour, 2016).

Conclusion

In the current century, companies are faced with challenges of big data. Big data is a term that is used t describes data that are large in volume, have high velocity and are in different variety. To make sense out of the big data, companies need to analyze and be able to manage them. Big data management is a technique of acquiring and storing data for analysis. There are various technologies and techniques that are used in data management. One such technique is NoSQL.

NoSQL is a contrast of Structured Query Language (SQL) that was introduced in 1980. NoSQL has four major architectures of acquiring and handling data. These are key-value store, document based store, and column based store and graph based. The key-value store uses big hash tables of keys and values. Document based stores data in documents that are made up of tagged elements. Column based contains different columns that are called column families that contain different columns of data instead of rows. Lastly, graph based used edges and nodes to represent data on stores. NoSQL uses basic principles in handling data such as multiple queries, caching, replication and normalization as well as nesting data just to mention a few. The advantage of using NoSQL techniques is that they are fast when dealing with structured, semi-structured and unstructured data that are rapidly changing.

REFERENCES

Cattell, R. (2011). Scalable SQL and NoSQL data stores. ACM SIGMOD Record, 39(4), 12. http://dx.doi.org/10.1145/1978915.1978919

Exploring the Different Types of NoSQL Databases Part ii. (2017). 3Pillar Global. Retrieved 18 May 2017, from https://www.3pillarglobal.com/insights/exploring-the-different-types-of-nosql-databases

Insights, S. (2017). What is Big Data and why it matters. Sas.com. Retrieved 18 May 2017, from https://www.sas.com/en_us/insights/big-data/what-is-big-data.html

Katsov, I. (2017). NoSQL Data Modeling Techniques. Highly Scalable Blog. Retrieved 18 May 2017, from https://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques/

NoSQL Database: Cassandra is a Better Option to Handle Big Data. (2016). International Journal Of Science And Research (IJSR), 5(1), 24-26. http://dx.doi.org/10.21275/v5i1.nov152557

NoSQL Databases Explained. (2017). MongoDB. Retrieved 18 May 2017, from https://www.mongodb.com/nosql-explained

Omidi, M., & Alipour, M. (2016). Why NOSQL And The Necessity of Movement Toward The NOSQL Data Base. IOSR Journal Of Computer Engineering, 18(05), 116-118. http://dx.doi.org/10.9790/0661-180502116118

Sadalage, P. J., & Fowler, M. (2013). NoSQL distilled: a brief guide to the emerging world of polyglot persistence.Upper Saddle River, NJ: Addison-Wesley.

What is big data analytics? — Definition from WhatIs.com. (2017). SearchBusinessAnalytics. Retrieved 18 May 2017, from http://searchbusinessanalytics.techtarget.com/definition/big-data-analytics

What is NoSQL (Not Only SQL database)? — Definition from WhatIs.com. (2017). SearchDataManagement. Retrieved 18 May 2017, from http://searchdatamanagement.techtarget.com/definition/NoSQL-Not-Only-SQL