From: Gregory Piatetsky-Shapiro
13, 2012 I attended a Big Data Cluster Roundtable: Big Data in Motion,
Mass Technology Leadership Council, @MassTLC.
are some of my observations and notes from the meeting.
MassTLC snazzy new logo conveyed
speed and digital know-how. Sara Fraim@sarafraim introduced
Big Data as one of MassTLC 9 clusters – others include Cloud, Digital Games,
Energy, Healthcare, Mobile, Robotics, Sales & Marketing, and Software
Development. One major MassTLC goal is to get synergy from cross-collaboration
of people from different communities/clusters joining together.
About about 25 people assembled
to look at Big Data in Motion, which can be described as Big Data in (almost)
real-time, or structuring Big Data streams, or combining Big Data with Complex
Event Processing. ‘Real-time’ for Big Data can be milliseconds or minutes,
depending on the application.
Eric Schnadig said that Tervela
is focusing on very high-speed connectivity, and is actively used on Wall
street for high performance trading. He observed that 2012 is the peak of the
Big Data hype cycle, and predicted that in 2013 the conversation will shift to
the more defined segments. The focus of Tervela is Big Data in Motion, or
dealing with Big Data I/O – how to move it?
The main issues in moving Big
Data are the same as with regular data: bandwidth, security, fault-tolerance.
audience member remarked that because of the Big Data size it is easier to move
compute to data than move data to compute.
This concept has also been
described as the Data Gravity – bigger
data is harder to move, and has stronger pull on applications.
However, Big Data
in Motion also deals with capturing data in
real-time, properly reacting to it, creating if needed parallel streams, e.g.
one to traditional DW, another to backup for compliance purposes, and another
to compute engine for decisions.
It was also observed that analytics
is not the only application for Big Data. Sometimes communication is a better
use case. For example, incident information can be visualized in real-time on
Eric Alterman talked about the
importance of creating context. Once you have a customer complaint, then need
to get previous complaints, route to appropriate dept for action.
Big Data frequently has
embarrassinly parallel handling of data.
There are many use cases for
treating data on the flow, and storage is only the end-point.
There are many use cases for
dealing with Big Data in real-time, but information needs to be indexed first
for fast access.
For healthcare IT, the data
lifecycle is 60-90 days – glacial pace.
Splunk indexes information as it
gets there. Indexing technology like B-trees is 40 years old but there are more
modern methods like LSM trees (Log Structured Merge Trees).
An important use case of Big Data
is that the receiver of information can be a machine, the end user can be not a
human but an app.
Companies will be more successful
if they can have mid-level programmers deal with Big Data, not data scientists
(lots of work on such tools now).
Ad re-targeting – a good use case
for Big Data in ‘real-time’.
The old centralized Data
Warehouse, built to reduce duplication, does not make sense today when storage
is so cheap. There is move away from one centralized DW to many smaller DW.
Another use case: NYT large
etailers are monitoring and changing prices in real-time.
I asked a question:
Some responses were
analytics – there are too many platforms, and it is too expensive to move
data. One possible exception is Amazon Redshift,
which is a fast, petabyte-scale data warehouse service in the cloud. Cloud
analytics companies can succeed if they are on AWS platform
(obtained from Big Data Analytics) and quality of analytics is also
over-hyped. Same also my HBR blog post
infancy – there is still big opportunity.
Underhyped topic: is how to store Big Data for a
very long time, so that it will survive frequent change of formats.
Overall, an interesting meeting
and a good discussion!
Some of the tweets during the
- Lawrence @schwartzlaws Indexing
is a bottleneck for Big Data in motion. Need to look at alternatives to
traditional B-Trees – at #BigData Roundtable @MassTLC
Fraim @sarafraim Not
about analyzing each piece but using all data from mult sources at once
and streaming to right place @flow@masstlc
Fraim @sarafraim the
next ‘gem’ in big data is moving the data from place to place, processing,
and rerouting rapidly @masstlc#bigdata