Denormalize Transactional Data using Informatica

Let’s de-normalize transactional data.
Here is the input:

CUSTOMER_IDPRODUCT_NAMEPRODUCT_PRICETRANSACTION_DATE
1COOKING BOOKS12.506/07/2013 10:57:10
1MEDICAL BOOKS120.506/07/2013 10:57:10
1TRAVEL BOOKS10.506/07/2013 10:57:10
2FOOD PROCESSOR12.506/07/2013 10:57:10
2INTELLIGENT OVEN120.506/07/2013 10:57:10
2SAFE10.506/07/2013 10:57:10
3CLOTHING12555.2506/07/2013 10:57:10
3CAR VACUUM15520.4506/07/2013 10:57:10
3HOTEL PASS1550.5506/07/2013 10:57:10
3MAGAZINE1110.5606/07/2013 10:57:10
4ZOO PASS110.406/07/2013 10:57:10

Here is the output:
The table is quite wide so you can download it as csv.

Here is the mapping overview:
SQ ==> EXP ==> AGGR ==> TGT

mappingOverview

Step 1:
In SQ_ORDERS sort by CUSTOMER_ID

Step 2:
In EXP_SetSequenceNumbers create three variables as follow:

  1. currCustomer_v ==> to_char(CUSTOMER_ID)
  2. Seq_v ==> IIF(currCustomer_v != prevtCustomer_v , 1 ,  Seq_v + 1  ). Each time we get a new customer_id the Seq get reset to 1 otherwise it will increment by 1.
  3. prevtCustomer_v ==>to_char(CUSTOMER_ID)

Make sure that prevtCustomer_v is below currCustomer_v and Seq_v as in the following pic:

setSequenceNumberTrans

Step 3:
In AGG_DenormalizeOrders make sure that the following is set:

  1. Sorted input is checked.
  2. Group by is checked for Customer_ID
  3. Create outputs to house ProductName, ProductPrice and TransactionDate
  4. For each sequence select the value of seq. For instance, ProductName3 select FIRST( PRODUCT_NAME, seq = 3) etc…

For better understanding please see pic below:

AggregatorDenormalizeOrders

 

Step 5:
Connect the aggregator to the target.
Done!

Install cx_Oracle on Ubuntu

If you are still having trouble running sqlplus please refer to my previous article on how to install Oracle Client.

In order to install cx_Oracle open a Terminal and execute the following:


sudo -i

apt-get install python-pip python-dev build-essential

export ORACLE_HOME=/usr/lib/oracle/11.2/client

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$ORACLE_HOME

cd $ORACLE_HOME

ln -s libclntsh.so.x.x libclntsh.so

pip install cx_Oracle

Done!

 

Install Oracle client on Ubuntu Desktop 12 (32 bits)

As you know Oracle does not support Ubuntu. I was not able to find a link where it provides a step by step Oracle client installation. Therefore, I am going to provide them.

Step 1:
Open a Terminal and run the following:

sudo apt-get install alien

 

Step 2:

Go to http://www.oracle.com/technetwork/topics/linuxsoft-082809.html and download the following:
oracle-instantclient11.2-basic-11.2.0.3.0-1.i386.rpm
oracle-instantclient11.2-sqlplus-11.2.0.3.0-1.i386.rpm
oracle-instantclient11.2-devel-11.2.0.3.0-1.i386.rpm

Step 3:

Go back to the terminal and move to the directory where you downloaded the files and execute the following:


sudo alien -i oracle-instantclient11.2-basic-11.2.0.3.0-1.i386.rpm
sudo alien -i oracle-instantclient11.2-sqlplus-11.2.0.3.0-1.i386.rpm
sudo alien -i oracle-instantclient11.2-devel-11.2.0.3.0-1.i386.rpm
sudo apt-get install libaio1
sudo vi /etc/ld.so.conf.d/oracle.conf and add /usr/lib/oracle/11.2/client/lib/
sudo ldconfig

Step 4

At this point sqlplus should be running from the terminal with no issues.

 

Add the following to .bashrc:

export ORACLE_HOME=/usr/lib/oracle/11.2/client
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$ORACLE_HOME

 

Step 5:

Execute the following:


source .bashrc

cd $ORACLE_HOME

sudo mkdir network

cd network/

sudo mkdir admin

cd admin

Step 6:

Put your tnsnames.ora file in /usr/lib/oracle/11.2/client/network/admin/

Done!

 

How to denormalize data in Informatica

Most of you know that Informatica does not provide a function for denormalization. However, denormalizing a table is easy and could be done in Informatica by using two functions: an Expression and an Aggregator.

Step 1:
Let us look at the data:

CUSTOMER_BRAND

CUSTOMER_IDBRAND
1BMW
1CHEVY
1FORD
2BMW
2FORD
3CHEVY

 

Step 2:
Let us look at the desired output:

CUSTOMER_IDBMWFORDCHEVY
1YYY
3NNY
2YYN

 

Step 3:
Let us look at the general architecture of the mapping:

The mapping consists of the following a source table, a SQ, an Expression, an Aggregator, an Expression and a Target.

 

Step 4:
Let us look at the SQ:

Make sure that we sort the on CUSTOMER_ID asc. This is needed so we can send sorted input to the aggregator.

 

Step 5:
Let us look at the Expression function

The expression function will set the Brand to “Y” if it exists else it will set it to “N”.


Step 6:

Let us look the Aggregator function:

Since “Y” > “N” then if a customer has the brand it will output “Y” else it will output “N”.


Step 7:
The final step is easy just connect the Aggregator to the Target:

If you are interested in a more complicated example please check: Denormalize Transactional Data using Informatica

Perl Pie (aka perl -p -i -e)

Few days ago a co-worker came to me for an advice and he said the following “I have a huge file that has 2 million records and I would like to change a date in it from 04/29/2012 to 04/15/2012.” He emphasized that there was only one column that has that date.

In order to solve the problem I went ahead and executed this statement on unix command line:

$ perl -p -i -e 's|04/29/2012|04/15/2012|g' file

The run time was few seconds and all records got updated properly.

In Oracle “Not IN” is not your best friend with big data

Let us assume that you have a Brand table that you’d like to query and get information such as count, de-normalization etc…

Method number 1 that is slow consists of the following query:

SELECT
CUST_ID,
COUNT(DISTINCT BRAND) AS BRAND_COUNT,
CASE(WHEN BRAND = 'MERCEDES' THEN 'Y' ELSE 'N' END) AS HAS_MERCEDES, CASE(WHEN BRAND = 'BMW' THEN 'Y' ELSE 'N' END) AS HAS_BMW
FROM
CUST_BRAND
WHERE
BRAND NOT IN ('VOLKSWAGEN' , 'OPEL' , 'FIAT' )
GROUP BY CUST_ID;

Method number 2 that is faster consists of the following query:

SELECT
CUST_ID,
COUNT(DISTINCT BRAND) AS BRAND_COUNT,
CASE(WHEN BRAND = 'MERCEDES' THEN 'Y' ELSE 'N' END) AS HAS_MERCEDES, CASE(WHEN BRAND = 'BMW' THEN 'Y' ELSE 'N' END) AS HAS_BMW
FROM
CUST_BRAND ,
( SELECT 'VOLKSWAGEN' AS BRAND FROM DUAL UNION ALL
SELECT 'OPEL' AS BRAND FROM DUAL UNION ALL
SELECT 'FIAT' AS BRAND FROM DUAL ) EXCLUDE
WHERE
CUST_BRAND.BRAND = EXCLUDE.BRAND(+)
AND
EXCLUDE.BRAND IS NULL
GROUP BY CUST_ID;

Method # 2 invokes an outer join to a dummy table. By doing so, if you have a big table you will see a substantial amount of saved time.

Side Note:
Query #2 is recommended if your column is null-able. You can use “NOT IN” if you add “Brand is not null” condition. I recommend to write both queries, look at the explain plans, execute and then decide which one is more efficient.

Oracle Pivot function

I was amazed at how handy the Pivot function can be. It empowers the user to provide aggregate and denormalized view of a data set which could be a query or a table. In this article I am going to provide a brief example of the pivot function.

First of all let us set some data: 


CREATE TABLE SAMPLE_TRANSACTIONS
(
TRANSACTION_ID NUMBER(8),
COMPANY_NAME VARCHAR2(30),
STATE VARCHAR2(2),
TRANSACTION_AMOUNT NUMBER(10,2)
);
INSERT INTO SAMPLE_TRANSACTIONS VALUES
( 1 , 'PALERMO' , 'TX' , 300.22
);
INSERT INTO SAMPLE_TRANSACTIONS VALUES
( 2 , 'PALERMO' , 'CT' , 600.22
);
INSERT INTO SAMPLE_TRANSACTIONS VALUES
( 3 , 'PALERMO' , 'PA' , 900.669
);
INSERT INTO SAMPLE_TRANSACTIONS VALUES
( 4 , 'BOLOGNA' , 'TX' , 500.22
);
INSERT INTO SAMPLE_TRANSACTIONS VALUES
( 5 , 'BOLOGNA' , 'CT' , 1100.22
);
INSERT INTO SAMPLE_TRANSACTIONS VALUES
( 6 , 'BOLOGNA' , 'PA' , 100.669
);
INSERT INTO SAMPLE_TRANSACTIONS VALUES
( 7 , 'PARMA' , 'NY' , 1300.22
);
INSERT INTO SAMPLE_TRANSACTIONS VALUES
( 8 , 'PARMA' , 'NJ' , 1600.22
);
INSERT INTO SAMPLE_TRANSACTIONS VALUES
( 9 , 'PARMA' , 'CA' , 1900.669
);

Now let us view the data:

SELECT * FROM SAMPLE_TRANSACTIONS;

 

TRANSACTION_IDCOMPANY_NAMESTATETRANSACTION_AMOUNT
1PALERMOTX300.22
2PALERMOCT600.22
3PALERMOPA900.67
4BOLOGNATX500.22
5BOLOGNACT1100.22
6BOLOGNAPA100.67
7PARMANY1300.22
8PARMANJ1600.22
9PARMACA1900.67

 

Now let us create a report that provides the sum of the transactions by company at a state level:
Let us first execute this query:


SELECT *
FROM
( SELECT COMPANY_NAME , STATE , TRANSACTION_AMOUNT
FROM SAMPLE_TRANSACTIONS
) PIVOT
( SUM(TRANSACTION_AMOUNT)
FOR
COMPANY_NAME IN
('BOLOGNA' AS BOLOGNA , 'PARMA' AS PARMA, 'PALERMO' AS PALERMO) )
ORDER BY STATE;

Results:

STATEBOLOGNAPARMAPALERMO
CA{null}1900.67{null}
CT1100.22{null}600.22
NJ{null}1600.22{null}
NY{null}1300.22{null}
PA100.67{null}900.67
TX500.22{null}300.22

 

Now let us create a report that provides the sum of the transactions by state at a company level:

Let us first execute this query:


SELECT *
FROM
( SELECT COMPANY_NAME , STATE , TRANSACTION_AMOUNT FROM SAMPLE_TRANSACTIONS
) PIVOT
( SUM(TRANSACTION_AMOUNT)
FOR STATE IN ('CT' CT , 'NY' NY, 'TX' TX) )
ORDER BY COMPANY_NAME;

Results

COMPANY_NAMECTNYTX
BOLOGNA1100.22{null}500.22
PALERMO600.22{null}300.22
PARMA{null}1300.22{null}

 

As you can see above there are a lot of possibilities on how to leverage the pivot function. For further reading I recommend this link Oracle Pivot.

 

Update else Insert in Informatica

In order to successfully implement upsert using Informatica we need to follow these steps:

  1. In the mapping we need an update strategy with DD_UPDATE before the Target.
  2. In the session the we need to have the following settings:
    1. Under Properties:Treat Source rows as Data driven
    2. Under Mapping, the Target should have both Insert and Update else insert checked.

I do not recommend update else insert for large sets of data if you need a process that is efficient and fast. It is better to stick with having two separates paths in the mapping. One for updates and the other one for Inserts.

How to efficiently de-dupe a table in Oracle

A bad approach to de-dupe a table is to delete the dupes. Unfortunately, when dealing with tables that have millions of rows the delete method is going to take a long time and consume a lot of server resources. Below is a step by step example that shows how we can de-dupe a table using “Partition By” Oracle command.

Step 1:
Let us create a table and load data into it.

CREATE TABLE TEST_DEDUPE
(
CUSTOMER_ID NUMBER(10),
BRAND       VARCHAR2(20),
TRANSACTION_DATE DATE
);
INSERT
INTO TEST_DEDUPE VALUES
(
1 ,
'GREAT VALUE' ,
TO_DATE('01/20/2011' , 'MM/DD/YYYY')
);
INSERT
INTO TEST_DEDUPE VALUES
(
1 ,
'GREAT VALUE' ,
TO_DATE('01/20/2011' , 'MM/DD/YYYY')
);
INSERT
INTO TEST_DEDUPE VALUES
(
1 ,
'GREAT VALUE' ,
TO_DATE('01/20/2011' , 'MM/DD/YYYY')
);
INSERT
INTO TEST_DEDUPE VALUES
(
1 ,
'GREAT VALUE' ,
TO_DATE('01/22/2011' , 'MM/DD/YYYY')
);
INSERT
INTO TEST_DEDUPE VALUES
(
1 ,
'GREAT VALUE' ,
TO_DATE('01/22/2011' , 'MM/DD/YYYY')
);
INSERT
INTO TEST_DEDUPE VALUES
(
2 ,
'MEMBERS MARK' ,
TO_DATE('01/22/2011' , 'MM/DD/YYYY')
);
INSERT
INTO TEST_DEDUPE VALUES
(
2 ,
'MEMBERS MARK' ,
TO_DATE('01/22/2011' , 'MM/DD/YYYY')
);
INSERT
INTO TEST_DEDUPE VALUES
(
2 ,
'MEMBERS MARK' ,
TO_DATE('01/22/2011' , 'MM/DD/YYYY')
);
INSERT
INTO TEST_DEDUPE VALUES
(
2 ,
'MEMBERS MARK' ,
TO_DATE('01/15/2011' , 'MM/DD/YYYY')
);
INSERT
INTO TEST_DEDUPE VALUES
(
2 ,
'MEMBERS MARK' ,
TO_DATE('01/15/2011' , 'MM/DD/YYYY')
);
INSERT
INTO TEST_DEDUPE VALUES
(
3 ,
'MEMBERS MARK' ,
TO_DATE('01/15/2011' , 'MM/DD/YYYY')
);
INSERT
INTO TEST_DEDUPE VALUES
(
3 ,
'MEMBERS MARK' ,
TO_DATE('01/15/2011' , 'MM/DD/YYYY')
);
INSERT
INTO TEST_DEDUPE VALUES
(
4 ,
'MEMBERS MARK' ,
TO_DATE('01/15/2011' , 'MM/DD/YYYY')
);

Now let us look at the data:

SELECT * FROM TEST_DEDUPE
CUSTOMER_IDBRANDTRANSACTION_DATE
1GREAT VALUE20-JAN-11
1GREAT VALUE20-JAN-11
1GREAT VALUE20-JAN-11
1GREAT VALUE22-JAN-11
1GREAT VALUE22-JAN-11
2MEMBERS MARK22-JAN-11
2MEMBERS MARK22-JAN-11
2MEMBERS MARK22-JAN-11
2MEMBERS MARK15-JAN-11
2MEMBERS MARK15-JAN-11
3MEMBERS MARK15-JAN-11
3MEMBERS MARK15-JAN-11
4MEMBERS MARK15-JAN-11

 

We noticed that Customer_IDs {1, 2, 3} have duplicates. For instance, Customer_ID 1 have three rows with Brand = “Great Value” and Purchase date = “20-JAN-11”. After de-duping we should only have one row.


Step 2:

Now let us rename the table that we need to de-dupe.

ALTER TABLE TEST_DEDUPE
RENAME TO TEST_DEDUPE_OLD;

Step 3:
Now let’s create a new table that has the same name as the original table.

CREATE TABLE TEST_DEDUPE AS
(SELECT CUSTOMER_ID ,
BRAND ,
TRANSACTION_DATE
FROM
(SELECT CUSTOMER_ID ,
BRAND ,
TRANSACTION_DATE ,
ROW_NUMBER()
OVER
(PARTITION BY CUSTOMER_ID , BRAND , TRANSACTION_DATE
ORDER BY TRANSACTION_DATE ) AS RN
FROM TEST_DEDUPE_OLD
)
WHERE RN = 1
);

Final Step:
Let us view the data without dupes:

SELECT * FROM TEST_DEDUPE;
CUSTOMER_IDBRANDTRANSACTION_DATE
1GREAT VALUE20-Jan-11
1GREAT VALUE22-Jan-11
2MEMBERS MARK15-Jan-11
2MEMBERS MARK22-Jan-11
3MEMBERS MARK15-Jan-11
4MEMBERS MARK15-Jan-11