BOB: Business Objects Board
Not endorsed by or affiliated with SAP

Register | Login 

Want to sponsor BOB? 
Want to sponsor BOB? (Opens a new window)  

General Notice: Upcoming Events: PGHBOUG: Nov 1.

Things you should never do - a collection and discussion
1 members found this topic helpful
Goto page 1, 2  Next
 
Search this topic... | Search DI: Designer and Job Design... | Search Box
Register or Login to Post    Forum Index -> Data Integrator -> DI: Designer and Job Design  Previous TopicPrint TopicNext Topic
Author Message
Werner Daehn
Forum Devotee
Forum Devotee



Joined: 17 Dec 2004

speaker.gif*6
Posts: 10591
Location: Germany


flag
PostPosted: Thu Oct 28, 2010 3:47 am 
Post subject: Things you should never do - a collection and discussion

Can we collect common design mistakes in this thread here? I would then document those in the Wiki, just looking for ideas.
Please post your opinions about such a thread here and if you disagree with what others said.

I have noticed that users make typical mistakes which are hard to change later. Changing a lookup from no_cache to pre_load_cache (or vice versa depending on the case) is an example of an optimization which can be identified and changed quickly. But other design decision would require lot of work.

_________________
Werner Daehn
Product Manager Data Services
SAP AG
�The opinions expressed in this post are my own views and not those of SAP.�


Last edited by Werner Daehn on Thu Oct 28, 2010 5:21 am, edited 1 time in total
This post has been reported for Other. The current status is Closed / Resolved.
Moderator Dave Rathbun closed this report Click for Details
Back to top
Werner Daehn
Forum Devotee
Forum Devotee



Joined: 17 Dec 2004

speaker.gif*6
Posts: 10591
Location: Germany


flag
PostPosted: Thu Oct 28, 2010 3:53 am 
Post subject: Re: Things you should never do - a collection

Approach: Process one row (one batch of rows) at the time.
Solution: Run the dataflow frequently, e.g. in a while loop.

Problem with that: The overhead of starting and stopping a dataflow is quite large, the dataflow has to connect to all sources and targets, process activation, etc. That might be as quick as 10 seconds, but if you would process one row per dataflow the throughput is 0.1 rows/sec instead of the 50'000 rows/sec the dataflow itself can process.

_________________
Werner Daehn
Product Manager Data Services
SAP AG
�The opinions expressed in this post are my own views and not those of SAP.�


Last edited by Werner Daehn on Thu Oct 28, 2010 4:31 am, edited 2 times in total
Back to top
Werner Daehn
Forum Devotee
Forum Devotee



Joined: 17 Dec 2004

speaker.gif*6
Posts: 10591
Location: Germany


flag
PostPosted: Thu Oct 28, 2010 4:07 am 
Post subject: Re: Things you should never do - a collection

Approach: We want full control of all the intermediate steps of the ETL flow.
Solution: We load all the data into multiple staging layers, first we copy the source changes into the replication layer, from there data is harmonized and loaded into the quality layer, from there ........

Problem with that: The amount of I/O is doubled by each layer. You are losing parallel processing.

What is faster:
Source -> Transform1 -> Loader_Stage1/Reader_Stage1 -> Transform2 -> Loader_Stage2/Reader_Stage2 -> Transform3 -> Loader_Stage3/Reader_Stage3 -> Transform4 -> Target
Source -> Transform1 -> Transform2 -> Transform3 -> Transform4 -> Target

If the source is 100MB data, the first version would have 800MB worth of I/O. The second version just 200MB I/O - one read of 100MB, one write of 100MB. Overhead I/O = Source_Volume * 2 * number_of_layers.
The other problem is, in the second version all transformations work in parallel. While Transform4 is processing the row 1, Transform3 is busy with row 2 already, ... reader is reading row 5 already. (Actually there are arrays and buffers inbetween). In the first version you have to wait for the first dataflow to process the entire data and only then the dataflow for transform2 can be started.

_________________
Werner Daehn
Product Manager Data Services
SAP AG
�The opinions expressed in this post are my own views and not those of SAP.�


Last edited by Werner Daehn on Thu Oct 28, 2010 4:31 am, edited 1 time in total
Back to top
agulland
Principal Member
Principal Member



Joined: 17 Mar 2004

Posts: 474
Location: UK


flag
PostPosted: Thu Oct 28, 2010 4:20 am 
Post subject: Re: Things you should never do - a collection

Good topic and i'm sure there's loads!

The first one to spring to mind, and it's one I've often seen, is a team starts developing a solution and dive straight into the coding and haven't setup any datastore configurations or system configurations and then have a tonne of rework to do when they port their code from dev to test.

AL

_________________
AL Gulland, BI Consultant

Further BI articles and reading @ www.gulland.com/wp
Back to top
Werner Daehn
Forum Devotee
Forum Devotee



Joined: 17 Dec 2004

speaker.gif*6
Posts: 10591
Location: Germany


flag
PostPosted: Thu Oct 28, 2010 4:30 am 
Post subject: Re: Things you should never do - a collection

Approach: Only join one table at the time and filter the amount of data to be joined first.
Solution: A dataflow like

Code:
Source -> Query -> Join_Query1 -> Join_Query2 -> Join_Query3 -> Target
                         ^              ^              ^
                         |              |              |
                  Filter_Query1  Filter_Query2   Filter_Query3
                         ^              ^              ^
                         |              |              |
                  LookupTable1   LookupTable2    LookupTable3



Problem with that: Often these kind of flows cannot be executed efficiently. Depending on the details the optimizer might be able to rewrite all into one large SQL statement pushed down to the source database. But often not. In that case the data coming from Filter_QueryX will be cached in memory. Which still is okay. But what will happen in a delta load, e.g. the Source has just one changed row outputted by the object called "Query"? Then we cache all potential lookup candidates, we may end up caching million of rows for each lookup just to do one single lookup.

Better to rewrite above into a
Code:
Source -> Query -> Join_Query1 -> Join_Query2 -> Join_Query3 -> Target
                         ^              ^              ^
                         |              |              |
                  LookupTable1   LookupTable2    LookupTable3


where the Join_QueryX contains the filter clause. The DataServices optimizer will still pushdown the filter condition into the database when reading the lookup tables, so no change in that regards. But now we have the option to say cache the lookup results (for initial load) or lookup the one row in the database via a single row select.

Even better - if applicable - would be not joining the data but using the lookup_ext() function as a new_function_call instead. Same result but more control about the cache setting.

And if all happens to be in the same database, join all data in the first query if possible. That should be a golden rule: Join first, transform later. Not possible always but at least it should be the goal.

_________________
Werner Daehn
Product Manager Data Services
SAP AG
�The opinions expressed in this post are my own views and not those of SAP.�
Back to top
Werner Daehn
Forum Devotee
Forum Devotee



Joined: 17 Dec 2004

speaker.gif*6
Posts: 10591
Location: Germany


flag
PostPosted: Thu Oct 28, 2010 4:36 am 
Post subject: Re: Things you should never do - a collection

agulland wrote:
system configurations for dev to test migration.


Actually, I am not a big fan of this approach. Mine is more difficult to setup though. Want it to get it into the product to make it easier.

The biggest downside of datastore configurations for dev-test-prod migration is that the dev system has the settings for prod. So if I want to screw up, I could load the prod system with my job. Granted, in dev the connection parameters for prod might be set to something invalid.

_________________
Werner Daehn
Product Manager Data Services
SAP AG
�The opinions expressed in this post are my own views and not those of SAP.�
Back to top
Werner Daehn
Forum Devotee
Forum Devotee



Joined: 17 Dec 2004

speaker.gif*6
Posts: 10591
Location: Germany


flag
PostPosted: Thu Oct 28, 2010 5:19 am 
Post subject: Re: Things you should never do - a collection

Approach: DataServices is so limited when it comes to the sql syntax supported in Queries, I do not want to read all data from a table, I want to use variables to read just a subset
Solution: I use a SQL Transform everywhere

Problem with that: A SQL Transform has lots of limitations, no impact lineage information, no partitioned readers, hard to read & modify the SQL statements typed, no full pushdown in the sense of an insert..select.., etc.

And actually it is not needed. DataServices has an optimizer which will try to rearrange the objects in a way that makes sense from a performance perspective. So if a Source table is connected to a query and the query contains a where clause, an other join setting, a variable/parameter,... the optimizer will translate all of that into the database specific syntax and push it down to the database. So the result will be exactly the same thing as you would manually write. Sometimes even better.

And only for those few cases where the database has a capability but DataServices does not, or the SQL string needs to be dynamic (e.g. different table names), only then the workaround of using a SQL Transform makes some sense. Even then I would try other approaches like using database views instead.

_________________
Werner Daehn
Product Manager Data Services
SAP AG
�The opinions expressed in this post are my own views and not those of SAP.�
Back to top
Nemesis
Forum Enthusiast
Forum Enthusiast



Joined: 08 Jun 2004

Posts: 1094
Location: London, England


flag
PostPosted: Thu Oct 28, 2010 8:29 am 
Post subject: Re: Things you should never do - a collection

Approach: I've Added my Job to the Central Repository and I've made many updates so some of the original dataflows are no longer used.
Solution: I delete the dataflows from the Central Repository

Problem with that: I can't successfully retrieve historical versions of the job...and the Central Repository can get corrupted.

Real Solution: Don't delete objects from the Central Repository

_________________
Contracting in London again.
Back to top
agulland
Principal Member
Principal Member



Joined: 17 Mar 2004

Posts: 474
Location: UK


flag
PostPosted: Thu Oct 28, 2010 11:28 am 
Post subject: Re: Things you should never do - a collection

wdaehn wrote:
Actually, I am not a big fan of this approach...


Yes, it's not perfect but its better than hard coding.

A separate config file (text, xml or whatever) would be better. If i'm building some java app then I would normally externalise connection strings and other config items in an external file - something similar in data services would be good!

_________________
AL Gulland, BI Consultant

Further BI articles and reading @ www.gulland.com/wp
Back to top
dastocks
Principal Member
Principal Member



Joined: 11 Dec 2006

Posts: 166



PostPosted: Tue Nov 02, 2010 7:26 am 
Post subject: Re: Things you should never do - a collection

[quote="wdaehn
Code:
Source -> Query -> Join_Query1 -> Join_Query2 -> Join_Query3 -> Target
                         ^              ^              ^
                         |              |              |
                  LookupTable1   LookupTable2    LookupTable3


where the Join_QueryX contains the filter clause. The DataServices optimizer will still pushdown the filter condition into the database when reading the lookup tables, so no change in that regards. [/quote]

As a general point I try to reduce the number of query transforms in my DFs to the minimum necessary. So in the above example I would try to combine all the Join_Queries into one if at all possible. This makes the DF far more maintainable: I've lost count of the number of times I've had to go through a DF where a new column from Source was required on Target, and you have to add that column to a long string of lookup joins where one or two joins would do the same job.
Back to top
Nemesis
Forum Enthusiast
Forum Enthusiast



Joined: 08 Jun 2004

Posts: 1094
Location: London, England


flag
PostPosted: Tue Nov 02, 2010 8:35 am 
Post subject: Re: Things you should never do - a collection

dastocks wrote:

As a general point I try to reduce the number of query transforms in my DFs to the minimum necessary. So in the above example I would try to combine all the Join_Queries into one if at all possible. This makes the DF far more maintainable: I've lost count of the number of times I've had to go through a DF where a new column from Source was required on Target, and you have to add that column to a long string of lookup joins where one or two joins would do the same job.


Have you seen the 'Propagate Column From' option that became available from version 3.2? I think this would simplify the task of adding columns to multiple transforms. I prefer to keep query transforms relatively straight forward. So I tend to have multiple simple transforms rather than a single complicated one.

_________________
Contracting in London again.
Back to top
eganjp
Forum Addict
Forum Addict



Joined: 12 Sep 2007
ASUG Icon
Posts: 4280
Location: Denver Area


flag
PostPosted: Thu Nov 04, 2010 9:30 am 
Post subject: Re: Things you should never do - a collection and discussion

Approach:A Dataflow generates lot of rows and it needs to get the data into the target table faster.
Solution: Turn on Bulk Loader

Problem with that: Bulk Loader doesn't handle updates

Real Solution:
1. Don't use Bulk Loader, increase the number of loaders
2. Split the deletes/updates and inserts into separate branches of the Dataflow each with it's own target table. The inserts use Bulk Loader, the deletes/updates don't. Be sure to use transaction control and set the order in which the tables are loaded so that you don't try to send deletes/updates at the same time the Bulk Loader is running (and likely has the table locked).
3. ????? I'm sure there are a number of other solutions but the above two are the "simple" ones I could come up with.

_________________
Jim Egan
ProKarma-an SAP Consulting Partner
Speaker at RMOUG Training Days 2013
Speaker at Sapphire Now 2013
Speaker at Business Objects User Conference 2013
-My ETL is faster than your database! - Advanced ETL performance tuning for Data Services
-Kill ETL bugs before they kill you!
My ETL blog

Latest Blog Posts
• 2014-04-17 Can Your ETL Make Your Reporting Faster?
• 2014-04-04 ProKarma Acquires Mantis Technology Group
• 2014-04-02 How to Create Charts in Webi (Part 2 of 2)
Back to top
tikkanipraveen
Forum Member
Forum Member



Joined: 23 Sep 2008

Posts: 29


flag
PostPosted: Fri Dec 03, 2010 2:14 am 
Post subject: Re: Things you should never do - a collection and discussion

Hi,

Well i had faced this problem and thought to share this with you guys.

Whenevev i try join output of two sql transforms in a query transform it goes for a cartesian product.

Regards,
Praveen
Back to top
Werner Daehn
Forum Devotee
Forum Devotee



Joined: 17 Dec 2004

speaker.gif*6
Posts: 10591
Location: Germany


flag
PostPosted: Fri Feb 18, 2011 5:25 am 
Post subject: Re: Things you should never do - a collection and discussion

Converting date, datetime datatypes:

What format does a date datatype have, dd/mm/yyyy? dd.mm.yyyy? yyyy-dd-mm??
Answer: None. A date/datetime datatype has no format at all. Only if you use implicit conversion from date to char or vice versa, then you use the database session default format which can be different for each database, even each client connecting to the same database.

Example: print(sysdate());
If you click on validate all you will see you have an implicit conversion from datetime to varchar, as you didn't specify a to_char() yourself, the DataServices default format will be used, which is equivalent to you writing
to_char(sysdate(), 'YYYY.MM.DD HH24:MI:SS')


Example: to_date(sysdate(), 'dd-mm-yyyy');
So you convert a datetime datatype (sysdate output) to a datetime with the to_date function??? Well, validate that as well and you will see the sysdate() gets converted to a string implicitly. So what you are actually writing is:
to_date(to_char(sysdate(), 'YYYY.MM.DD HH24:MI:SS'), 'dd-mm-yyyy');
And as the format strings do not match, you will get an error.

_________________
Werner Daehn
Product Manager Data Services
SAP AG
�The opinions expressed in this post are my own views and not those of SAP.�
Back to top
Werner Daehn
Forum Devotee
Forum Devotee



Joined: 17 Dec 2004

speaker.gif*6
Posts: 10591
Location: Germany


flag
PostPosted: Tue May 31, 2011 12:11 am 
Post subject: Re: Things you should never do - a collection and discussion

Approach: I want to use a date inside the sql() function
Solution: sql('DataStore', 'DELETE FROM table WHERE date_col = [$start_date] ');

Problem with that: is actually multi-fold.
First is the parameter substitution. In all strings DataServices does see, it will check for either [] or {} combinations. Each text inside these brackets will be put into the expression parser and replaced by its value. So the string '1+1=[1+1]' will result in the string '1+1=2'. And with {} extra quotes will be put around. Hence in above example, the string
DELETE FROM table WHERE date_col = 2011.12.31
will be sent to the database. And the database will raise an error saying that 2011.12.31 is a weird number it does not understand. So the least we have to do is using {} brackets. Then the string sent to the database will be
DELETE FROM table WHERE date_col = '2011.12.31'

If we are lucky, the database does recognize this string as a date, in most cases it will say things like "I expect a date in the format DD/MM/YYYY". Either way, we do not want to rely on luck, we should define the format.

Note, the to_char is a DataServices function, it is used inside the parameter substitution bracket {} or []. The to_date() function is part of the string sent to the database, so it is a database function. If you use a different database, check the database syntax.

In Oracle for example the way to convert a string to date is by using the to_date() function of Oracle. Note: There is a to_date() function in DS as well, but that one is not used!

sql('DataStore', 'DELETE FROM table WHERE date_col = to_date({$start_date}, \'YYYY.MM.DD\') ');

Now the database will receive the string
DELETE FROM table WHERE date_col = to_date('2011.12.31', 'YYYY.MM.DD')
and hopefully understand what we want.

The last bit is, the {$start_date} forces DataServices to convert a date into a string. As we haven't specified any format, it will use the internal conversion rules, which use the format YYYY.MM.DD HH24:MI:SS always. And we get a conversion warning. So it would be safer to use the DataServices(!! we are inside the expression!!) to_char() function.

sql('DataStore', 'DELETE FROM table WHERE date_col = to_date( { to_char($start_date, \'YYYY.MM.DD\')} , \'YYYY.MM.DD\') ');

_________________
Werner Daehn
Product Manager Data Services
SAP AG
�The opinions expressed in this post are my own views and not those of SAP.�
Back to top
Display posts from previous:   
Register or Login to Post    Forum Index -> Data Integrator -> DI: Designer and Job Design  Previous TopicPrint TopicNext Topic
Page 1 of 2 All times are GMT - 5 Hours
Goto page 1, 2  Next
 
Jump to:  

Index | About | FAQ | RAG | Privacy | Search |  Register |  Login 

Get community updates via Twitter:

Not endorsed by or affiliated with SAP
Powered by phpBB © phpBB Group
Generated in 0.0986 seconds using 17 queries. (SQL 0.0031 Parse 0.0623 Other 0.0333)
CCBot/2.0 (https://commoncrawl.org/faq/)
Hosted by ForumTopics.com | Terms of Service
phpBB Customizations by the phpBBDoctor.com
Shameless plug for MomentsOfLight.com Moments of Light Logo