IBM InfoSphere Streams and Stream Computing

What is Stream Computing?

Nowadays, we hear a lot of buzz around stream computing. What is stream computing? According to the definition from Wikipedia – “Stream processing is a computer programming paradigm, related to SIMD (single instruction, multiple data), that allows some applications to more easily exploit a limited form of parallel processing.  In computing, the term stream is used in a number of ways, in all cases referring to a sequence of data elements made available over time. A stream can be thought of as a conveyor belt that allows items to be processed one at a time rather than in large batches.”

Why Stream Computing?

The need for faster, parallel, real-time, and secure processing of voluminous data is increasing day by day.  Here are a few examples of the gigantic increase in the volume of data analysis and real-time data processing:

  1. Every hour, Walmart processes more than one million customer transactions, inserting records into databases which are over 2.5 petabytes in size. That volume of information is equivalent to 167 times the amount of data contained within the books of the entire Library of Congress!
  2. Every day, the New York Stock Exchange processes 1 terabyte of trade information.
  3. Google is way ahead of the NYSE, processing 24 petabytes of data daily.
  4. Every 20 minutes, 2.7 million photographs are uploaded to Facebook.

The standard hardware and software configuration that we have been using in the past cannot handle the increasing need for processing speed and volume. To answer the call for a quantum leap in processing needs, stream computing is the new hero on the block!  The ability to process large volumes of data in real-time and in parallel makes it uniquely positioned to meet the challenges described above.

IBM InfoSphere Streams

InfoSphere Streams is a part of the IBM big data platform. IBM InfoSphere Streams provides the ability to process and act on all the business data consistently and in real-time.

The following are some of the capabilities of InfoSphere Streams:

  1. Continuous analysis of massive volumes of data at rates of up to petabytes per day.
  2. Ability to perform complex analytics of unstructured data from a variety of  heterogeneous data sources including text, images, audio, voice, VoIP, video, web traffic, email, GPS data, financial transaction data, satellite data, sensors, and any other type of digital information that might be  relevant to  business.
  3. Enhanced ability to respond in real-time to data and business exceptions by leveraging sub-millisecond latencies.
  4. Effective adaptation of rapid changes in metadata and data forms.
  5. Scalable and seamless deployment  on a variety of hardware configurations including: Symmetric Multi Processing (SMP), Massively Parallel Processing (MPP), Clustered Configuration, or Grid Computing

Features and Benefits of InfoSphere Streams

The following list gives a synopsis of the features and benefits of InfoSphere Streams:

  1. Provides user-centric tools relevant for an agile development environment.  There are easy to use interfaces for developers and administrators to build effective data-mining capabilities within applications and workflows. Developers can leverage integrated toolkits and sample applications specific to different industries
  2. The application includes Streams Studio, an Eclipse-based interactive development environment (IDE) that supports InfoSphere Streams application development with editors, wizards, application structure graphs and run-time monitoring for InfoSphere Streams applications.
  3. Support for reuse of existing Java or C++ code, as well as Predictive Model Markup Language (PMML) models.
  4. IBM WebSphere® MQ Low Latency Messaging (LLM) transport technology and InfiniBand support.