PR3 Systems Blog http://pr3systems.com/blog Mon, 21 May 2012 16:49:24 +0000 http://wordpress.org/?v=2.8.4 en hourly 1 Desktop Hardware Selection http://pr3systems.com/blog/2012/05/21/desktop-hardware-selection/ http://pr3systems.com/blog/2012/05/21/desktop-hardware-selection/#comments Mon, 21 May 2012 16:49:24 +0000 akopec http://pr3systems.com/blog/?p=105 Building a computer can be an intimidating task for someone who has never worked with hardware before. The hardware selection is the most important step when building a computer. This guide is for those interested in learning about customizing hardware selection. It will cover every required component while taking into consideration price, performance and complexity. Each component will also be provided with a budget friendly and current (May 2012) example.

Some of the advantages of building your own desktop as opposed to buying a prebuilt one are:

  • Price: It is much cheaper to build computers yourself than to order prebuilt ones.
  • Customization: You can choose any piece of hardware for your exact requirements.
  • Knowledge: Learn how each of the hardware components interacts with each other.

So let’s get started selecting our hardware.

Hardware Selection

At a minimum we have to select the following components:

  • Central Processing Unit (CPU)
  • Motherboard
  • Random Access Memory (RAM)
  • HDD or SSD (Or Both)
  • Power Supply Unit (PSU)
  • Computer Case

Additional Components:

  • CD/DVD Drive
  • Graphics Processing Unit (GPU)
  • Keyboard/Mouse/Monitor

CPU: The first decision you have to make is what kind of CPU you would like to purchase. Take into consideration which applications you plan on running on the machine. Are these applications CPU intensive? Do they rely more on multi-threading or multi-processing? Your budget can have a huge influence on your CPU selection as AMD processors tend to cost less than their Intel counterparts. Be sure to check customer reviews for any glaring problems with your selection. The CPU Socket Type will influence your motherboard selection as the socket types have to match.

Example CPU: AMD FX-6100:

http://www.newegg.com/Product/Product.aspx?Item=N82E16819103962 $139

Motherboard: After deciding on your CPU the next step is to select a motherboard. When searching for a motherboard, filter your search with a matching socket type from your CPU. If motherboard selection does not include on-board graphics processing then you will be required to purchase a separate graphics processing unit.

Example Motherboard: ASUS M5A88-V EVO:

http://www.newegg.com/Product/Product.aspx?Item=N82E16813131733 $115

Desktop RAM is one of the cheapest components of your computer and provides a big performance boost for a minimal increase in cost. A motherboard will typically accept up to four sticks of RAM. Take into account that your RAM operating frequency must match one of the accepted speeds on the motherboard. Also note that an operating system can limit both operating frequency as well as the amount of readable memory.

Example RAM: G. SKILL 4 GB (2 x 2 GB)

http://www.newegg.com/Product/Product.aspx?Item=N82E16820231394 $21

Hard Drive: Determine how much space you will need on your disk drive. If you will be storing files in a remote location you do not require having a lot of space on your drive. Take into account that over time applications will continue to get larger and your movies and pictures will take up more room if they are higher quality.

Example Drive: Western Digital 500 GB Hard Drive:

http://www.newegg.com/Product/Product.aspx?Item=N82E16822136769 $75

CD/DVD Drive: Lets you read and write DVDs. Pick one which costs the least while having good reviews.

Example DVD Drive: ASUS DVD Drive:

http://www.newegg.com/Product/Product.aspx?Item=N82E16827135204 $18

Power Supply Unit: In order to determine what kind of power supply unit you need, it’s possible to use one of the various online voltage calculators. These calculators determine the amount of voltage which is required to run your computer. You should NEVER skimp out on the PSU as it has the ability to ruin all of your expensive hardware.

Wattage Calculator:

http://images10.newegg.com/BizIntell/tool/psucalc/index.html?name=Power-Supply-Wattage-Calculator

Example PSU: Thermaltake 430W PSU:

http://www.newegg.com/Product/Product.aspx?Item=N82E16817153023 $46

Computer Case: Choose a computer case which is an appropriate size for your hardware components. Certain graphics cards (GPUs) take up a lot of space and require extra room for proper storage and air flow.

Example Case: NZXT Mid Tower Case:

http://www.newegg.com/Product/Product.aspx?Item=N82E16811146075 $40

Good luck in your selection!

]]>
http://pr3systems.com/blog/2012/05/21/desktop-hardware-selection/feed/ 0
IBM Netezza Data Warehouse Appliance http://pr3systems.com/blog/2012/05/01/ibm-netezza-data-warehouse-appliance/ http://pr3systems.com/blog/2012/05/01/ibm-netezza-data-warehouse-appliance/#comments Tue, 01 May 2012 18:36:03 +0000 akopec http://pr3systems.com/blog/?p=93 The IBM Netezza data warehouse appliances are purpose-built for crunching massive volumes of data quickly and efficiently. This is delivered with IBM Netezza Analytics which is fully integrated into the IBM Netezza data warehouse asymmetric massively parallel processing (AMPP) architecture enabling data exploration, model-building, model-diagnostics and scoring with unprecedented speed.

IBM Netezza data warehouse appliances eliminates the administrative tasks include query indexing, storage management, buffer pool tuning, memory allocation and schema optimization.

The combination of IBM patented AMPP and Field Programmable Gate Arrays (FPGAs) delivers the fast query performances and modular scalability on highly complex mixed workloads, and supports Business Intelligence and Data warehouse users.

Asymmetric Massively Parallel Processing (AMPP) Architecture:

The approach of AMPP is performance. Scalability goals can be met using elements of both SMP and MPP, applying each method where it is best suited to meet the specific needs of BI applications operating on terabytes of data. This is two tired architecture:

Tier – 1: SMP Host:

  • Compiles the queries received from the Users
  • Generates the query execution plan
  • Divides the query into sub-queries or snippet which can be executed in parallel and distributes snippets for the SPU.
  • Finally returns the results to the Users

Tier – 2: SPU (Snippet Processing Unit):

  • This tier contains lot of SPU’s which operates in parallel
  • Each SPU is an
    1. Intelligent query processing
    2. Storage node
    3. Consists powerful commodity processer, dedicated memory a disk drive and a field-programmable disk controller with hard-wired logic to manage data flows and process queries at the disk level
  • The massively parallel, shared-nothing SPU blades provide the performance advantage of MPP
  • The SPUs respond to requests from the host, they are highly autonomous, performing their own scheduling, storage management, transaction management, concurrency control and replication

The data traffic among SPU’s and between SPU and SMP host is greatly reduced using Intelligent Query streaming technology. This technology intelligently filters records as they stream off the disk, delivering only the relevant information for each query instead of moving data into memory or across the network for processing. Intelligent Query Streaming is performed on each SPU by a Field-Programmable Gate Array (FPGA) chip that functions as the disk controller, and is also capable of basic processing as data is read off the disk. The system is able to run critical database query functions such as parsing, filtering and projecting at full disk reading speed, while maintaining full ACID (Atomicity, Consistency, Isolation, and Durability) transactional operations of the database.

IBM Netezza Analytics is designed to accelerate analytic queries and shorten query times, effectively providing better and faster answers to the most complex business questions.

This is used for:

  • Data exploration and discovery
  • Data transformation
  • Model building
  • Model diagnostics
  • Model scoring

The IBM Netezza 100 series, IBM Netezza 1000 series and the IBM Netezza High Capacity Appliance series are the parts of IBM Netezza data warehouse appliance family.

Netezza 100 series:

Delivers the faster performance for entry-level data warehouses. This is powerful for small to mid-sized data warehouses and can be used as development and test systems for high-performance BI applications.

This is an easy-to-use appliance that delivers high performance out of the box, with no indexing or tuning required. It is delivered ready-to-go for immediate data loading and query execution and integrates with all leading ETL, BI and analytic applications through standard ODBC, JDBC and OLE DB interfaces.

Netezza 100 is a very affordable analytic option, delivering up to 10 TB of user data capacity in a compact physical and environmental footprint.

Netezza 1000 series:

IBM Netezza 1000 is a purpose-built, standards-based data warehouse appliance that architecturally integrates database, server, storage and advanced analytic capabilities into a single, easy-to-manage system. The IBM Netezza 1000 appliance is designed for rapid and deep analysis of data volumes scaling into the petabytes.

This helps modelers to operate on the data directly inside the appliance instead of having to offload it to a separate infrastructure and deal with the associated data preprocessing, transformation and movement. Once the model is built, the prediction and scoring can be done right where the data resides, in line with other processing, on an as-needed basis. Users can get the results of prediction scores in near real-time, helping operationalize advanced analytics and making it available throughout the enterprise.

IBM Netezza 1000 adheres to IBM’s basic principle of moving processing close to the data.

Each IBM Netezza 1000 appliance contains multiple Snippet Blades or S-Blades, where SQL query code segments (or ’snippets’) and complex analytic processes are executed. The S-Blades are intelligent processing nodes that make up the massively parallel processing engine of the appliance. Each S-Blade is an independent server that contains powerful multi-core Intel CPUs, IBM Netezza’s unique multi-engine FPGAs and gigabytes of RAM – all balanced and working concurrently to deliver peak performance.

IBM Netezza High Capacity Appliance:

The IBM Netezza High Capacity Appliance extends IBM Netezza’s family of data warehouse appliances to new extremes of data capacity, scaling to multiple petabytes of user data. This will enable organizations to meet a variety of analytical and historical data storage requirements with a single cost-effective appliance.

The IBM Netezza High Capacity Appliance series accelerates the industry’s leading massively-parallel data warehouse architecture to multi-petabyte scale, creating a “queryable archive” that can store, query and analyze thousands of terabytes of data quickly and cost-effectively.

The IBM Netezza High Capacity Appliance series arrives preconfigured and is typically ready to load data.

As databases scale to tens or hundreds of terabytes and petabytes, the increased data movement becomes unworkable, resulting in “data inertia”. The IBM Netezza High Capacity Appliance runs analytic computations directly in the appliance – without moving data – to ensure maximum analytics performance.

The IBM Netezza High Capacity Appliance reduces the cost and expands the available disaster recovery options for IBM Netezza users. Offering a wide range of capacities several times larger than those available in IBM Netezza 1000 (formerly known as TwinFin), one IBM Netezza High Capacity Appliance can serve as a consolidated hot-standby platform for one or more IBM Netezza 1000 appliances. This option is a good fit for users with multiple systems who need to redirect critical workloads to hot-standby systems during an outage.

The IBM Netezza High Capacity Appliance processes queries using IBM Netezza’s proven Asymmetric Massively Parallel Processing (AMPP) architecture. With AMPP, load, query and analytic work is split into many pieces and run in parallel to accelerate results. IBM Netezza High Capacity Appliances further shorten query times and raise throughput using software innovations such as ZoneMap acceleration, Clustered Base Tables and automatic data compression designed to streamline data movement and minimize I/O.

]]>
http://pr3systems.com/blog/2012/05/01/ibm-netezza-data-warehouse-appliance/feed/ 0
Performance Tuning in IBM InfoSphere DataStage http://pr3systems.com/blog/2012/04/23/performance-tuning-in-ibm-infosphere-datastage/ http://pr3systems.com/blog/2012/04/23/performance-tuning-in-ibm-infosphere-datastage/#comments Mon, 23 Apr 2012 21:15:33 +0000 akopec http://pr3systems.com/blog/?p=86 Performance is a key factor in the success of any data warehousing project. Care for optimization and performance should be taken into account from the inception of the design and development process. Ideally, a DataStage® job should process large volumes of data within a short period of time. For maximum throughput and performance, a well performing infrastructure is required, or else the tuning of DataStage® jobs will not make much of a difference.

One of the primary steps of performance tuning is to examine the end-to-end process flow within a DataStage® job and understand which steps in the job are consuming the most time and resources. This can be done in the several ways:

1. The job score shows the generated processes, operator combinations, data sets, frame-work inserted sorts, buffers and partitions in the job. Score can be generated by setting the APT_DUMP_SCORE environment variable to TRUE before running the job. It also provides information about the node-operator combination. A score dump can help detect redundant operators, which can be used in modification of the job design to remove them.

2. The job monitor can be accessed through IBM Infosphere DataStage® Director. It provides a snapshot of job’s performance (data distribution/skew across partitions, CPU utilization) at runtime. APT_MONITOR_TIME and APT_MONITOR_SIZE are the two environment variables that control the operation of the job monitor, which takes a snapshot every five seconds by default. This can be changed by changing the value of APT_MONITOR_TIME.

3. Performance Analysis, a new capability beginning in DataStage® 8.x, can be used to collect information, generate reports and view detailed charts about job timeline, record throughput, CPU utilization, job memory utilization and physical machine utilization (shows processes other than the DataStage® activity running on the machine). This is very useful in identifying the bottlenecks during a job’s execution. Performance Analysis can be enabled through a job property on the execution tab, which collects data at runtime. (Note: By default, this option is disabled)

4. Resource Estimation, a toolbar option available in DataStage® 8.x, can be used to determine the system requirements needed to execute a particular job based on varying source data volumes and/or to analyze whether the current infrastructure can support the jobs that have been created.

There are several key aspects that could affect the job performance and these should be taken into consideration during the job design:
• Parallel configuration files allow the degree of parallelism and resources used by parallel jobs to be set dynamically at runtime. Multiple configuration files should be used to optimize overall throughput and to match job characteristics to available hardware resources.
• A DataStage® job should not be overloaded with stages. Each additional stage in a job reduces the resources available for the other stages in that job, which affects the job performance.
• Columns that are not needed should not be propagated through the stages and jobs. Unused columns make each row transfer from one stage to the next more expensive. Removing these columns minimizes memory usage and optimizes buffering.
• Runtime column propagation (RCP) should be disabled in jobs to avoid unnecessary column propagation.
• By setting the $OSH_PRINT_SCHEMAS environment variable, we can verify that runtime schemas match the job column definitions. Avoid unnecessary data type conversions.
• Proper partitioning significantly improves overall job performance. Record counts per partition can be displayed by setting the environment variable, $APT_RECORD_COUNTS. Ideally, these counts should be approximately equal. Partitioning should be set in such a way so as to ensure an even data flow across all partitions, and data skew should be minimized. If business rules dictate otherwise, then repartitioning should be done as early as possible to have a more balanced distribution which will lead to improved performance of downstream stages.
• DataStage® attempts to combine stages (operators) into a single process, and operator combination is intended to improve overall performance and reduce resource usage. Avoid repartitioning and use ‘Same’ partitioning for operator combination to occur. However, in some circumstances, operator combination may negatively impact performance and in such cases, all the combinations should be disabled by setting $APT_DISABLE_COMBINATION=TRUE.
• Do not sort the data unless necessary. Sorts done on a database (using ORDER BY clause) are usually much faster than those done in DataStage®. Hence, sort the data when reading from the database if possible instead of using the Sort Stage or sorting on the input link.
• Sort order and partitioning are preserved in parallel datasets. If data has already been partitioned and sorted on a set of key columns, check the ″Don’t sort, previously sorted″ option for the key columns in the Sort Stage. When reading from these data sets, partitioning and sorting can be maintained by using the ‘Same’ partitioning method.
• Datasets store data in native internal format (no conversion overhead) and preserve partitioning and sort order. They are parallelized unlike sequential files and hence, are much faster. Datasets must therefore be used to land intermediate results in a set of linked jobs.
• Use Join Stage as opposed to Lookup for handling huge volumes of data. Lookup is most appropriate when the reference data is small enough to fit into available physical memory. Sparse lookup is appropriate if the driver to the Lookup is significantly smaller than the reference input (1:100).
• Avoid using multiple Transformer Stages when the functionality could be incorporated into a single stage. Use Copy, Filter, or Modify stages instead of Transformer for simple transformation functions like renaming or dropping columns, type conversions, filtering out based on certain constraints, mapping a single input link to multiple output links, etc.
• As much as possible, minimize the number of stage variables in a Transformer Stage as that affects performance, and also avoid unnecessary function calls.
• If existing Transformer-based jobs do not meet performance requirements and a complex reusable logic needs to be incorporated in the job, consider building your own custom stage.
• Data should not be read from Sequential files using ‘Same’ partitioning.
• Sequential files can be read in parallel by using the ‘Multiple readers per node’ option. Setting $APT_IMPORT_BUFFER_SIZE and $APT_EXPORT_BUFFER_SIZE environment variables may also improve performance of Sequential files on heavily loaded file servers.
• Use Hash method in Aggregators only when the number of distinct key column
values is small. A Sort method Aggregator should be used when the number
of distinct key values is large or unknown.
• SQL statements in Database stages can be tuned for performance. Appropriate indexes on tables guarantee a better performance of DataStage® queries.
• ‘Array Size’ and ‘Record Count’ numerical values in Database stages can be tuned for faster inserts and updates. Default values are usually very low and may not be optimal.
• Best choice of database stages is to use Connector stages for maximum parallel performance and functionality.

Conclusion: Performance issues can be avoided by following the above best practices and performance guidelines. ‘Performance Analysis’ and ‘Resource Estimation’ functionalities can be used to gather detailed performance related data, and assist with more complicated scenarios.

Special Thanks to These References:

  1. Information Server Documentation – Parallel Job Advanced Developer Guide
  2. http://www.element61.be/e/resourc-detail.asp?ResourceId=188
  3. http://DataStage®developer.blogspot.com/2008/01/DataStage®-performance-tuning.html
]]>
http://pr3systems.com/blog/2012/04/23/performance-tuning-in-ibm-infosphere-datastage/feed/ 0
Enterprise Cloud Computing and Software as a Service (SaaS) http://pr3systems.com/blog/2012/03/30/enterprise-cloud-computing-and-software-as-a-service-saas/ http://pr3systems.com/blog/2012/03/30/enterprise-cloud-computing-and-software-as-a-service-saas/#comments Fri, 30 Mar 2012 22:35:17 +0000 akopec http://pr3systems.com/blog/?p=74 What is cloud computing?

Cloud computing is a general term for anything that involves delivering hosted services over the Internet. Cloud computing is a technology that uses the internet and central remote servers to maintain data and applications. Cloud computing allows consumers and businesses to use applications without installation and access their personal files at any computer with internet access. This technology allows for much more efficient computing by centralizing storage, memory, processing and bandwidth.

What IT Needs:

Cloud computing comes into focus only when you think about what IT always needs: a way to increase capacity or add capabilities on the fly without investing in new infrastructure, training new personnel, or licensing new software. Cloud computing encompasses any subscription-based or pay-per-use service that, in real time over the Internet, extends IT’s existing capabilities.

Cloud services are broadly divided into 3 categories

Infrastructure-as-a-Service (IaaS)
Platform-as-a-Service (PaaS) and
Software-as-a-Service (SaaS).

Almost all who is reading this article are already using the cloud. For example if you are using Gmail, it is considered as an email exchange server hosted by Google. Icloud is a new service for Apple users where you can store your information etc…

What is enterprise cloud?

Ex: Imagine if you want a data warehouse platform (ex DataStage), what are all the things needed.

a. Infrastructure
b. Human Resources having expertise in the environment, infrastructure, application etc
c. Time and Money
d. Estimate the capacity and power of the tools.
e. Future forecast
f. Resource Utilization
g. Etc…

Instead of all the above, if you have a readily available system where the one can purchase the services and directly start developing their work (even without administering the servers). This is what the enterprise cloud will allow you to do.

Who needs cloud computing:

Estimation is in near future all the small to medium companies with the revenues of $5 Billion to $20 Billion will choose the cloud computing as an option.

Advantages to the clients:

1. Services are available On-Demand. Clients don’t need to wait for the infrastructure quotations, installations etc.
2. Services are Elastic: They can increase or decrease their capacity. No contracts.
3. Infrastructure management. All the servers and infrastructure is managed by the providers.
4. Utilization Factor. Mostly all the businesses use their infrastructure underutilized. In this model the resources are utilized more effectively.
5. Economical. All the above factors will make the cloud economical for the clients.

Role of PR3 Systems:

Types of Cloud: Private or Public

Public Cloud:

A public cloud sells the services to anyone on the Internet.

If the client chooses a public cloud, PR3 will offer the development/production support services in the areas of Data Warehousing (DataStage, QualityStage, Information Analyzer, Informatica, AbInitio). The PR3 team has experts in the industry in the area of datawarehousing (DataStage, Quality Stage, Information Analyzer, Metadata Workbench, AbInitio, Unix, Oracle, Netezza, db2-UDB, Teradata, db2-Mainframe, SQL Server etc…). PR3 also offers the exclusive training in DataStage, Quality Stage, and Information Analyzer etc…

You buy the cloud services from Amazon or salesForce.co or IBM, we will manage the development/deployment/production support work for you.

Private Cloud:

A private cloud is a cloud which is offered to specific users.
PR3 will offer (in near future) a private cloud for the small businesses with the advantage of Infrastructure maintenance, software development and maintenance services in the areas of data warehousing.

]]>
http://pr3systems.com/blog/2012/03/30/enterprise-cloud-computing-and-software-as-a-service-saas/feed/ 0
Cloud Computing in a Data Warehouse http://pr3systems.com/blog/2012/03/21/cloud-computing-in-a-data-warehouse/ http://pr3systems.com/blog/2012/03/21/cloud-computing-in-a-data-warehouse/#comments Wed, 21 Mar 2012 14:12:54 +0000 akopec http://pr3systems.com/blog/?p=56 Cloud computing is bringing a new era into the field of Data warehousing and Business Intelligence. Now, organizations can analyze large volumes of data in a faster and cheaper way. The highlights of cloud computing are: availability of software within minutes to few hours, no need to host datacenters and pay per use basis. The advantages of cloud computing are described below.

Benefits of cloud computing for Datawarehousing/Business Intelligence:

  1. Cost efficient: Datawarehousing requires a lot of different software and their hosting infrastructure/hardware is expensive. Customers can save by using the cloud and paying for the resource as per their use. No need to worry about the ownership and maintenance of the software and hardware. This will be very beneficial for the small and medium sized companies. Cloud computing provides organizations to spend less or decrease capital expenses. It only requires an operational expense which is very low compared to capital expenses.
  2. Time efficient: Using the cloud can save a lot of time, as customers need not spend weeks of effort for buying, installing and configuring hardware and software. Cloud can be available immediately for use. This will save a lot of time for projects.
  3. Acquisition and mergers: During acquisition and mergers there will be lot of duplicate Datamarts and applications which need to be integrated or decommissioned. Cloud computing will help organizations to focus more on integrating the business rather than focusing on the software and hardware infrastructure. This will improve the core competency of the organization.
  4. Flexibility: Cloud computing will provide customers with more options. With low operating costs and easy availability of software and hardware infrastructure, organizations can migrate from one technology to another comparatively easily.
  5. More competitiveness: Cloud will create more competitiveness among providers of similar services. Users can evaluate similar software and hardware from different vendors easily and quickly. This healthy competitiveness will give customers more options.

Challenges of cloud computing in Datawarehousing/Business Intelligence:

  1. Data security: Data security will be a challenge as it will be stored in cloud which outside the infrastructure of the owner. For cloud to become fully successful this issue needs to address in future as lot of companies e.g. financial companies will not use cloud for the sensitive data.
  2. Ability to process bulk data: Since Datawarehousing/BI requires bulk data handling, the performance of processing bulk data in the cloud need to be improved as currently it may not be as fast as the traditional Datawarehousing applications.

Some examples of current cloud computing services in Datawarehousing/BI:

    Software as a Service (SaaS): LogiXML, Birst, Lugicera.

    Platform as a Service (PaaS): Microsoft Azure, Google App Engine.

    Infrastructure as a Service (IaaS): Amazon EC2

    Conclusion: I believe the cloud computing will evolve more in future to accommodate mission critical Datawarehousing/BI applications. It may revolutionize the area of Datawarehousing and Business Intelligence. It will help the small and medium sized business to use more Analytical data because of a lower operational cost.

    If you have any questions or need more information please contact PR3 Systems: info@pr3systems.com.

]]>
http://pr3systems.com/blog/2012/03/21/cloud-computing-in-a-data-warehouse/feed/ 0
InfoSphere Information Server 8.7 http://pr3systems.com/blog/2012/03/05/infosphere-information-server-8-7/ http://pr3systems.com/blog/2012/03/05/infosphere-information-server-8-7/#comments Mon, 05 Mar 2012 15:11:35 +0000 akopec http://pr3systems.com/blog/?p=37 InfoSphere Information Server v8.7 is IBM’s newest release of the ETL tool DataStage. It was released in October 2011 and it offers some of the best and newest features available to a DataStage developer.

Why Upgrade?

There are many reasons to stay current with one of the newer releases of DataStage. The most significant reason for upgrading to Information Server 8.7 would be for the new features. New options can help developers improve their job design and overall development. New releases of DataStage can improve performance of existing designs without changes. The extended functionality allows new jobs to further increase performance. Each new release also offers bug fixes of known issues with previous releases. Information Server 8.7 is built off of the fix pack for version 8.5, therefore it is stable and many of the known bugs for 8.5 are fixed in 8.7. Moving forward, it appears that newer releases with continue to improve performance of existing features and increase functionality of the full suite of products.

What’s New?

DataStage offers an advantage over competitors various tools within the suite of products. Version 8.7 takes Business Glossary to a new level with extended functionality and a great user interface to interact with. The user interface now features just one URL for any web-based activities which gives users the ability to make and change glossary content in the environment of how it will be viewed.

Version 8.7 offers a new rule stage which makes it easier to develop, execute and monitor any information that is located in a data store. This stage allows rules which have been created in InfoSphere Information Analyzer to be accessed by the DataStage and QualityStage designer, letting developers receive instant feedback on the correctness of analyzed rows.

Information Server 8.7 provides a new IBM InfoSphere DataStage and QualityStage Operations Console. The Operations Console is a view only client which gives the developer access to the run-time environment of Information Server. It has graphs providing summaries of resource consumption, running processes, etc. It makes it very simple to performance tune changes to run more efficiently.

For more information or to know how Information Server 8.7 can benefit your organization please contact info@pr3systems.com.

]]>
http://pr3systems.com/blog/2012/03/05/infosphere-information-server-8-7/feed/ 0
Why Master Data Management is Essential for an Organization and What it Requires. http://pr3systems.com/blog/2012/02/20/why-master-data-management-is-essential-for-an-organization-and-what-it-requires/ http://pr3systems.com/blog/2012/02/20/why-master-data-management-is-essential-for-an-organization-and-what-it-requires/#comments Mon, 20 Feb 2012 16:22:13 +0000 akopec http://pr3systems.com/blog/?p=30

Master Data is key information that is critical to the operation of a business. In other words, it encompasses the key business entities like customer, product, employee, vendor etc.

Master data should not be confused with transactional data or the data in a data warehouse.  OLTP takes care of the daily operations/ transactions of an organization. These are action details or in other words verbs like in grammar. Master data has all the business entities of an organization like customer, products, employee, location etc. These are like nouns in grammar. E.g. Employee works in location. Employee and Location make up master data whereas work details are stored in OLTP database. Master data is used by multiple transaction based applications in an organization and is in normalized form. A data warehouse stores the data in a de-normalized way with star or snow flake schema. This data is used for non-operational purpose to help in aiding business specific decisions. We will have to update the dimension tables daily based on updates from the master data.

Master data is used in numerous applications within an organization. Each application has its own database and the same master data is stored in multiple databases. Each application uses this master data in a different way and the developers building or maintaining an application are worried about managing their own data which is a subset of the domain rather than having a single version of the truth. This would lead to data being duplicated and inconsistent across different applications within the same organization. There are three common issues with master data:

  1. Data Synchronization: Since the master data is used by multiple applications, incorrect data can lead to undesired consequences and prove critical to the success of an organization. Say, for example a customer moves from one state to another and he updates the customer service department. But customer service does not update the marketing department and the accounts department. The marketing department would still send advertisements/ promotions to the old address thereby leading to a loss of advertising dollars. The accounts department would continue sending the bill to the old address and continue deducting sales tax for the old address instead of the new state thereby leading to legal compliance issues. Similar issues can arise when say a customer gets married and she changes her last name. It could lead to duplicate entries for the same customer with different names across different applications.
  2. Mergers: We see a lot of mergers and acquisitions in the corporate world. Each company has its own master data for customer, product etc in bits and pieces across its own set of applications. The same data could be present in both companies with different database keys. Say, for example we have the same customer, who is identified by his SSN in company 1. But company 2 does not have his SSN information. Also, his address could be different across different companies as one of them might not have updated their records.
  3. Building new applications: Companies are constantly adding new services and coming up with new innovative ways to serve their customers and capture a bigger share of the market. When this happens, they will have to identify the data that will be needed for the new system. But the data is spread all over the company in bits and pieces. So, it can be a nightmare trying to identify all the systems that have a subset of the data needed.

So, there is a need to collect, correct, merge and manage this high value data and ensure its quality and integrity so that there is just a single version of the truth. This will ensure that all applications in an organization have access to the most accurate data and be informed when it changes. There are many advantages of creating a centralized master data management solution. A company could send a single consolidated monthly bill for all the services provided to him. It will enable the company to have a better understanding of the customers’ needs based upon his purchases and have a effective targeted marketing strategy with promotions and product information. It will prevent sending the same information multiple times to the same customer. Since, there is a single version of the truth, it can prevent fraudulent reporting. Also, it provides the senior management with most accurate data on vendor and accounts helping them to make better business decisions on what extra incentives can be provided to a vendor based on revenue volumes.

A master data management solution accumulates data from the source application, consolidates and makes it available to other applications within an organization. It manages the common data and the applications that access it. Whenever, the data in an application changes, it spawns processes to accumulate the pieces of data and build a single unified version of the truth in a master data model.  It provides a single version of the truth for each subject area of a business like customer, product, employee etc. The key essentials for any master data management solution are:

  1. Identify all the data sources for the information you need and convert to a common format.
  2. Analyze the quality of the data and correct if required.
  3. Assemble this master list of data in a centralized database and share it. Provide access to new applications that would need this data.
  4. If there are existing applications that cannot be migrated immediately to the centralized database, then a trigger based mechanism should be set up to consolidate and update the centralized database whenever the data in an existing system changes.
  5. Whenever the master data in the centralized database changes, all the surrounding stakeholder applications would have to be notified.

For more information or to know how master data management can benefit your organization please contact info@pr3systems.com.

]]>
http://pr3systems.com/blog/2012/02/20/why-master-data-management-is-essential-for-an-organization-and-what-it-requires/feed/ 0
ETL Fundamentals http://pr3systems.com/blog/2009/09/24/etl-fundamentals/ http://pr3systems.com/blog/2009/09/24/etl-fundamentals/#comments Thu, 24 Sep 2009 21:41:51 +0000 Administrator http://pr3systems.com/blog/?p=12 As a company, chances are you have valuable data scattered throughout your system that needs to be gathered into a central location and accessed for business analysis. The problem is that currently the data exists in different systems, and in different formats.
This is where ETL comes into play. ETL is an acronym for Extract, Transform and Load, and is used to move data from one database to another, to form data marts and data warehouses, and also to convert databases from one format to another. ETL refers to the methods involved in accessing and manipulating source data and loading it into a target data warehouse.

Companies both large and small can use data warehouses to understand their data in a more analytical way and to extract the information contained within to increase their sales and revenues. While most companies already have databases to hold their transactional data (individual sales, for example) this data cannot easily be used for analysis. Whereas a database contains raw data, a data warehouse is a common repository which holds information which can be manipulated in such a way as to be used to answer complex business questions. “Which brand of hand soap did we sell the most of last month?” or “Which branch has the most customer traffic?” are the types of questions that the data warehouse can answer.

ETL is the process by which the data from everyday transactional databases gets moved or copied to the data warehouse. For example, a medical institution might have information on a patient in several departments and each department might list that customer’s information differently. The admissions department might list the patient by name, whereas the billing department might list the customer by account number or other ID. ETL can bundle all this data and consolidate it into a uniform presentation, enabling it to be stored in a database or data warehouse.

As we said, ETL stands for Extract, Transform, and Load. These three functions are combined into one tool to pull data out of one database and place it into another. On a high level, Extract is the process of reading data from a database, Transform is the process of converting data from one form into another, and Load is the process of writing data into the target database. ETL is software that enables businesses to combine and move their data. Again, the data can come from any source and can be in different forms or formats. Thus a company may use ETL to move data from one application to another, or to backup information, especially if transitioning to a new software application.

The first step in the ETL process is to map the data between source systems and target databases (data warehouses or data marts). Raw data can be written directly to disk, usually with only minimal restructuring. Structured source system data can be written to a relational table or flat file in this first step as well. This enables the extract to be quick and simple, and also allows the extract to be restarted in case of an interruption. After extraction, the data is transformed (modified) depending on specific business logic, and sent to the target repository. The data can then also be read multiple times if needed to support subsequent steps.

The second step is the cleansing of the source data in the staging area. Cleansing is an important function of ETL, as it eliminates duplicate or fragmented data, and data which is not required in the final target. ETL can be customized to fit your company’s particular needs.

The third step is transforming the cleansed source data and loading it into the target system. Transformation occurs via lookup tables, rules, and combining data. It is the process of converting the existing data into the format consistent with the data warehouse. The ETL software examines the data and, based on the rules it’s been given, updates it to the format required by the target repository. For example, a patient’s gender may be represented by “M/F” in one system, “0/1” in a second system, and “male/female” in yet another system. ETL is powerful enough to handle such dissimilarities by recognizing these different representations as the same information and convert them to the chosen format. Additionally, ETL can perform such functions as verifying phone numbers, standardizing fields, or expanding records with additional fields.

Transformation is perhaps the most powerful of the ETL steps. It can not only transform data from different departments but also data from different sources altogether. For example, data in an email program could be transformed right along with data from a manufacturing application, with the ultimate result being data of a common thread.

Loading is the process of storing the newly transformed data. The data is transported and loaded into the data warehouse via a variety of methods. The data can be normalized (e.g. by Snowflake Schema) or denormalized (e.g. by Star Schema). ETL allows you the flexibility to determine the method and outcomes ideally suited to your business needs.

Initially, the ETL process was performed by programmers using SQL code, which had its share of negatives. It could take long hours, utilize many resources, require complex coding, not to mention the challenge of maintaining the code. It was an unwieldy and tiresome task. Today, thankfully, there are more than a few ETL tools available on the market, which have eliminated these difficulties. The tools are extremely powerful and offer countless advantages in all stages of the ETL process (extraction, data cleansing, data profiling, transformation, debuggging and loading) when compared to the old method. They reduce costs, along with reducing coding efforts. The tools do the job quite well, and since they provide a graphical interface, require less expertise in database programming.

ETL tools process the data specifically to your business needs. They range from open source free tools to high price commercial tools. The amount of your data, what answers you will request from your data warehouse, and how often you require those answers, all need to be taken into account when choosing the right ETL tool for your business.

At PR3 Systems, we specialize in IBM’s DataStage ETL tool. As an IBM business partner, we provide the highest quality consulting and training services. Whether you are just beginning with DataStage, or upgrading to a later version, our IBM certified consultants will work with you to analyze your business needs and provide the development and/or training necessary to ensure you are getting the most out of your software. After all, especially in today’s economy, the bottom line is what counts. You’ve spent the time and resources to understand the positive impact DataStage can have on your business, so don’t stop there. Maximize your ROI by realizing the full potential DataStage offers. Contact us at 630-839-9258 or 630-364-1469 for more information.

PR3 Systems ~ Empowering you to make the right decisions at the right time.

]]>
http://pr3systems.com/blog/2009/09/24/etl-fundamentals/feed/ 0
DataStage Parallel Processing Architecture Overview Video http://pr3systems.com/blog/2009/08/28/datastage-parallel-processing-architecture-overview-video/ http://pr3systems.com/blog/2009/08/28/datastage-parallel-processing-architecture-overview-video/#comments Fri, 28 Aug 2009 15:43:17 +0000 Administrator http://pr3systems.com/blog/2009/08/28/datastage-parallel-processing-architecture-overview-video/ The following video presents an overview of DataStage Parallel Processing Architecture.

]]>
http://pr3systems.com/blog/2009/08/28/datastage-parallel-processing-architecture-overview-video/feed/ 0