By Isac Artzi, PhD
Faculty, College of Science, Engineering and Technology
What is data and how big does it have to be to qualify as “big data”?
The number of PayPal transactions has surpassed 12 million per day. During the same day, over one million items are exchanged at Walmart. These are just two examples of really big data. If we extend the calculation over 30 days, then PayPal must record, manage, authenticate, store and execute 360 million transactions each month. Similarly, Walmart accepts, records, processes and refunds 30 million items each month.
These are definitely examples of data that are REALLY big.
How is Big Data Collected?
We all know intuitively that there must be computer technology and storage mechanisms that handle massive amounts of information, but how does it actually happen? All data collected by Walmart points-of-sale (POS) used to be directed to an Oracle database. In 2012, Walmart started to migrate its Oracle database to a 250-node Hadoop cluster. A Hadoop cluster is a collection of interconnected computing devices, designed specifically for storing and analyzing huge amounts of unstructured data in a distributed computing environment. One node in the cluster contains the storage and processing capability for its own data. The nodes cooperate to form a parallel file system, called a Hadoop Distributed File System (HDFS). The programming model used to process massive amounts of data on HDFS is MapReduce. Both HDFS and MapReduce are concepts developed and made public by Google.
Companies like 42 Technologies provide turnkey solutions that collect data generated by each POS, regardless of device, and send a data stream to the backend database repository (e.g. a Hadoop cluster). There, the data is sorted, classified and mined. The result is an ongoing report of various metrics of interest to an organization, such as popularity of certain items or categories, sales activity throughout the day and trends over multiple periods of time.
How is Big Data Processed and Stored?
The challenge of big data processing starts with the simple act of collection. Enormous streams of financial and inventory data generated by companies like PayPal and Walmart require dedicated, secure network connections and reliable, secured storage. Data is so critical that it cannot be stored in just one place, but distributed over multiple locations, including backup copies of these locations – just in case something goes wrong.
Who Works with Big Data?
It is essential to understand that big data is a concept, not a technology. It encompasses many disciplines and depends of collaboration of complementary expertise of IT professionals, computer scientists, mathematicians, statisticians, software engineers, database administrators, systems engineers and cybersecurity experts, just to name a few. Companies like At Scale represent a category of solution providers that transform big data into reports and insights, collectively referred to as business intelligence.
Given the multidisciplinary nature of big data, the concept evolves continuously. Faster computers and memory mean more data can be collected and stored. Faster processors lead to more data being processed faster. More efficient algorithms mean new ways to analyze and mine data – and produce meaningful reports.
From a computer science perspective, the most interesting aspects of big data are the development of algorithms and development of software applications in languages like R, Scala or Python.
From an IT perspective, big data is a fascinating nexus in which complex system architecture meets distributed databases, requiring constant tuning for performance.
Big data theoreticians focus on certain areas of statistics and mathematics to build conceptual models for how big data can be collected, analyzed and visualized. It is rare for big data professionals to be experts in more than one area such as IT, computer science, statistics and database design. They typically focus on one area only.
Why is This Area Growing?
The opportunities presented by the continuous developments in this field ensure that it will remain exciting for many years to come. The pace of data creation only increases, as every single device connected to the Internet generates some data. This also ensures that in 10 years big data will be very different than it is today, just like today is very different from what it was in 2001, when SAS Corporation coined the term.
Fortunately, all the technologies and software development tools required to learn and explore big data are free or have community versions for non-commercial applications. They include Apache Spark server, R Studio, Scala programming tools, Python IDEs and more. Anyone can download and install these tools on a personal computer and start experimenting. Many cloud providers like Amazon Web Services, SAP Hana or Google Cloud, provide free or inexpensive means to set up a cloud-based exploratory big data servers. Companies like Tableau and IBM (Watson) provide free software for business analytics that can convert information mined on big data systems into actionable business intelligence.
In summary, while complex, large and ever-changing, big data is a concept that affects everyone all the time, in private and professional lives. We are all continuous, diligent contributors to the collection, analysis and dissemination of big data – each time we shop, post a Facebook message, watch TV, search the Web or stop at a red light.
There seems to be plenty of reason to encourage everyone to learn a little about big data…
Grand Canyon University offers computer science and information technology programs that help prepare students for in-demand careers. To learn more, visit our website or contact us today using the Request More Information button at the top of the page.
The views and opinions expressed in this article are those of the author’s and do not necessarily reflect the official policy or position of Grand Canyon University. Any sources cited were accurate as of the publish date.