To begin a discussion on the issue of Big Data, it is worthwhile to first define the term “Big Data”. According to , Big Data is defined as “technologies and initiatives that involve data that is too diverse, fast-changing or massive for conventional technologies, skills and infrastructure to address efficiently”. This serves as a good definition as it enumerates the key issues in dealing with Big Data, namely, data that exhibits the following characteristics: it is fast-changing and/or massive, it does not fit into conventional data storage systems (i.e. relational database system) and is generated and captured rapidly . While I believe this definition to be an accurate perception of Big Data in the scientific and business communities, it is worth noting that Big Data as a discipline is still in it’s infancy and thus open to different interpretations. A quick study on the etymology of the word Big Data provides great insight into this (see ). To further our understanding of Big Data, we will take a look at each of the main characteristics of Big Data as previously defined and discuss some of the primary issues that they introduce.
Fast-Changing and/or Massive
The shear volume of data being generated and recorded has reached levels never before seen. There are two primary factors driving this trend: the further penetration of the inter-connectivity of devices that generate data (commonly referred to as the Internet of Things (IoT)) and the availability of relatively low-cost storage devices – namely Hard Disk Drives (HDD). Devices that generate data are too numerous for our discussion but a limited list would include: devices used for scientific research, mobile devices (communication devices), software/applications, and financial transactions . With the proliferation of such devices, it is estimated that the rate in which data is created increases 10x every five years . What is more significant is that unstructured data accounts for nearly all of the data that is classified under the Big Data problem . To help gain perspective on the amount of data being generated, Facebook is estimated to consume 500 terabytes (TB) of data on a daily basis and a Boeing 737 will produce nearly 240 TBs of data from a single trans-continental flight . Our ability to record and store this amount of data is facilitated by the availability of high-capacity HDDs. It is estimated that 1 billion HDDs will be shipped annually by 2016 .
Aside from the challenges of recording and storing this amount of data, management further compounds the problems that Big Data is evolving to solve. This issue will be address by the second issue, that this data does not fit into conventional data storage systems.
Big Data storage systems
Traditional data storage systems, such as a relational database management system (RDMS), become too costly and inefficient to manage large volumes of data. The Big Data response to this is a new breed of management tools and systems that focus on large, unstructured (and structured) data sets. Of these tools, several have emerged as industry leaders. Those include, but are not limited to: Google BigTable, Simple DB, MongoDB, NoSQL, MemcacheDB and Voldemort . Additionally, the systems put in place to manage Big Data generally focus on one of two areas: operational or analytical [Hadoop]. Operational technology focuses on real-time, or near real-time, requests of the data [Hadoop]. Analytical tends to serve “high throughput; queries can be very complex and touch most if not all of the data in the system at any time” . So to further compound the issue of management, one needs to choose the right technology after determining their “Big Data” needs. This brings us to our last section, Big Data generation.
Generation of Big Data
As previously discussed, we have an unprecedented ability to generate and record information. I touched on unstructured, and inversely, structured data earlier in this paper. While this attests to the amount of data, it also plays a critical role in the type of data being generated and stored. Big Data not only deals with the massive amount of data being generated, but also that this data is increasingly diverse. This poses a significant problem as traditional systems were designed to handle smaller data sets, and data that is well-structured . The compound this issue further, these traditional systems were not designed to scale and it quickly becomes cost prohibitive to leverage them for Big Data operations/analytics. Again, Big Data systems are specifically designed to deal with these problems.
Conclusion and Security Concerns
As we have seen, Big Data encompasses information that does not fit into traditional data systems, either for storage, processing or retrieval. Big Data encompasses a variety of issues, the primary of those are what we discussed here: variety, velocity and volume [hadoop]. Up to this point our discussion has focused on the characteristics and issues surrounding Big Data. One topic that has been conspicuously absent is that of security. Security for these large, often-times distributed systems is complex and can not be approached as it is with traditional systems. The security issues are numerous, and include but are not limited to: input/generation validation, privacy in data mining and analytics, access controls and secure communications, granular access control and auditing .
Big Data Explained, mongoDB, [online] 2014, http://www.mongodb.com/big-data-explained (Accessed: 26 October 2014
Diebold, Francis X. “A Personal Perspective on the Origin(s) and Development of “Big Data”: The Phenomenon, the Term, and the Discipline” November 26 2012, http://www.ssc.upenn.edu/~fdiebold/papers/paper112/Diebold_Big_Data.pdf . (Accessed: 26 October 2014)
Khan, Nawsher, et al. Big Data, Technologies, Opportunities, and Challenges. The Scientific World Journal. 17 July 2014
Top Ten Big Data Security and Privacy Challenges, Cloud Security Alliance, [online] November 2012, http://www.isaca.org/Groups/Professional-English/big-data/GroupDocuments/Big_Data_Top_Ten_v1.pdf (Accessed: 26 October 2016)
Testing for NoSQL Injection, Open Web Application Security Project, [online] 2014, https://www.owasp.org/index.php/Testing_for_NoSQL_injection (Accessed: 26 October 2014)