A translation widget is provided for your convenience to facilitate translation of the English language version of this blog into several languages. If you choose to utilize this automated translation facility, please understand there may be deviations between the automated translation and the original English version. IBM is not responsible for any such automated translation deviations and offers the translated version "AS IS" without warranties of any kind.
Meeting the challenge of big data
Meeting the challenge of big data
Building a scalable, high-performance data cluster to serve big data analytics
Big data analytics projects inevitably begin with big hopes and grand plans. Getting started with Hadoop and Spark is straightforward. Pilot projects start with open source tools, sample data and a modest goal. Pilot success might be a single view of previously independent data that allows end-to-end reporting of a customer or a process. And then the fun begins. Real data. Regular reports. Scaling the cluster.
The weakest link and the easiest issue to address is storage. The default open source data storage, Hadoop Distributed File System (HDFS), was not designed for the enterprise. For example, the data organizations want to analyze is almost always from other sources. It may have client information that needs to be secured and access-controlled. Inevitably, other applications or users also want to use the same data as the big data cluster using industry-standard file or object interfaces.
The solution is to build a scalable, high-performance data cluster to serve big data analytics that also supports industry-standard protocols. With complete HDFS support and the scalable performance of the leading parallel file system, IBM Elastic Storage Server (ESS) 5.2 is the perfect building block for big data analytics storage. Built with IBM Spectrum Scale, the HDFS-transparent connector enables open source Hadoop and Spark frameworks to run without any modification. In fact, Hortonworks recently paper-certified IBM Spectrum Scale across its portfolio.
The real challenge for IBM Business Partners will be building the business case for ESS 5.2 and IBM Spectrum Scale with the three key stakeholders in a big data analytics project.
The data scientist on the core pilot team will resist any divergence from the open source choices because he or she fears the solution will not perform as the cluster scales. Proven on massive clusters, the IBM Spectrum Scale parallel file system removes the data bottlenecks common to other solutions. It can outperform HDFS on many benchmarks. However, it is the elimination of the data ingest transformation and extraction time that will greatly speed the time to insight and convince data scientists to really look at IBM Spectrum Scale and ESS.
For the IT department that needs to support the environment, choosing ESS can lower both the CapEx and the OpEx of the solution. Because ESS uses advanced erasure encoding to distribute and protect data, ESS data storage requires only about 22 percent more physical storage than data. In contrast, HDFS uses three-way replication—300 percent of the data being analyzed. In addition, ESS is architected to survive multiple failures and secure data integrity. Redundant data paths and end-to-end checksums make most issues strictly a background task to repair, not an emergency. The ESS GUI provides a complete view of hardware and software, and integrates into IBM Spectrum Control for a portfolio view of storage and trends.
The business-line executive sponsoring the project will probably be aware of the security and governance of the data and the results that open source data storage does not provide. Compliance with privacy and regulations often requires the ability to audit a fraud, risk or compliance result with archives. These are trivial for IBM Spectrum Scale systems, which are supported by IBM Spectrum Protect and most major backup solutions.
However, it may be the vision of the organization’s big data future that an executive will find most compelling. IBM Spectrum Scale and the Hadoop Connector can federate multiple data sources into a single HDFS view. It can span geographies for global collaboration. Plus, it can automatically tier to tape, on-premises object storage or the cloud to truly archive and analyze in place.
IBM Spectrum Scale, especially the ESS 5.2 all-flash solution, is a perfect reason to discuss the roadmap for big data analytics with your clients. They may still be in the pilot stage, but you will be ready for them when they move from sandbox to production. You can let me know what you think by using the comments feature below.
Manager, IBM Spectrum Solutions Marketing