28
Oct

Slowly Changing Dimension

In this post, we will discover Slowly Changing Dimension in Talend Open Studio.Talend SCD components

First, What is SCD?

  • SCD Type 1 overwrites original data in SCD table with new data. There is no tracking of historical changes (old data cannot be recovered). It allows to correct some kinds of error, for example misspellings in names.
  • SCD Type 2 tracks historical changes: if new data is different from old data, a new record is added in the SCD table, instead of erasing the old one. The old record becomes inactive, and the new record active.

Now, an example

For this example, we use a simple table client.

id name city

The SCD table client_scd we use to track modifications of clients over time is:

tk id name city start_date end_date
           
tk Unique identifier of records in SCD table (technical key).
id Source key, identifier in source table (that is, field id in client table in our example).
name Name of clients (matching field name in client table). We use SCD Type 1 to manage modifications of this column.
city City where clients live (matching field city in client table). We use SCD Type 2 to manage modifications of this column.
start_date Date of insertion of new records for SCD Type 2 modifications.
end_date End of validity of records. A record becomes invalid when a new record with the same source key is inserted in SCD table. The last inserted record for a given source key has a NULL value as end date.

Note that columns start_date and end_date are not needed if SCD type 2 is not used.

The Use case

We insert two records in client table, with a mispelling in Robert’s name.

id name city
1 Rober San Francisco
2 Bill LA

Now, we update client_scd table (let’s say we are July 1st). Since field id of each record in client table has no matching id (source key) in client_scd table, each record of client table is inserted in client_scd. start_date receive current date (ie when the records were inserted in SCD table).

tk id name city start_date end_date
1 1 Rober San Francisco 2007-07-01 NULL
2 2 Bill LA 2007-07-01 NULL

Let’s imagine Robert moves to LA. The updated client table would look like this:

id name city
1 Rober LA
2 Bill LA

When SCD table is updated (we are July 5th), there is already an existing source key in client_scd matching id of each record in client table. Bill has not been modified, so his record is not updated in SCD table. Robert has moved to LA, and city column is managed with SCD Type 2, so a new record will be inserted in SCD table. start_date of the new record will receive current date, so will end_date.

tk id name city start_date end_date
1 1 Rober San Francisco 2007-07-01 2007-07-05
2 2 Bill LA 2007-07-01 NULL
3 1 Rober LA 2007-07-05 NULL

Now, let’s imagine the mispelling on Robert’s name is corrected. The client table looks like:

id name city
1 Robert LA
2 Bill LA

When SCD table is updated (July 10th), Robert’s record has changed. Because name is managed as SCD Type 1, all records matching source key of Robert will be updated, without tracking modifications.

tk id name city start_date end_date
1 1 Robert San Francisco 2007-07-01 2007-07-05
2 2 Bill LA 2007-07-01 NULL
3 1 Robert LA 2007-07-05 NULL

Talend SCD components

Now, let’s see how SCD rules are implemented in tOracleSCD / tMysqlSCD / tDB2SCD / tIngresSCD / tMSSQLSCD / tSybaseSCD components.

SCD Component - General settings

Standard database connection parameters
FIG1
Figure 1: ToracleSCD properties, part 1
Parameters
Table Name of SCD table.
Schema type Schema of SCD table.
Source keys Key fields in source table.

SCD Component - SCD Type 1

SCD Figure 2

Figure 2: ToracleSCD properties, part 2

If you want to use SCD type 1, check the box, and add input fields which should be managed with SCD type 1.

SCD Component - SCD Type 2

FIG3
Figure 3: ToracleSCD properties, part 3

If you want to use SCD type 2, check the box, and add input fields which should be managed with SCD type 2. You must set start and end dates fields (used when new records are inserted in SCD table).

Additionnaly, SCD component provides two options for SCD type 2 rules.

  • First is active status: when this option is checked, you need to set an active field, which will contain 1 in tOracleSCD (’true’ in tMysqlSCD) for all records whose end date is NULL, 0 (’false’) for other records.

scd_client table of our last example would look like:

tk id name city start_date end_date active
1 1 Robert San Francisco 2007-07-01 2007-07-05 0
2 2 Bill LA 2007-07-01 NULL 1
3 1 Robert LA 2007-07-05 NULL 1
  • Second option is version management: when this option is checked, you need to set a version field. Each time a new record is inserted in SCD table, either it is the first record of a given source keys set and version field will receive value 1, or there is already a record with source keys matching keys of the new record, and version will be incremented by 1 for the new record.

scd_client table of our last example would look like:

tk id name city start_date end_date version
1 1 Robert San Francisco 2007-07-01 2007-07-05 1
2 2 Bill LA 2007-07-01 NULL 1
3 1 Robert LA 2007-07-05 NULL 2

Job example

Here is a job example (use case we have seen before) : we read records from a source table, and update a SCD table with these records.

Others features
Talend SCD component provides others interesting features like SurogateKey generation (thanks a talendRoutine, a sequence, an autocrement field, an input field or a table Max +1) and SCD type 3 fields (save the old value on a specific field).

Enjoy SCD component!

27
Oct

Talend Open Studio 2.2.1 is out

Dear Talend user,

3 weeks after the General Availability of Talend Open Studio 2.2.0, a new release called TOS 2.2.1 is available at http://www.talend.com/download.php.
What are the change ? No, new feature but 35 bugs was fixes (see the list below).
Enjoy the release!
Cedric

- 0002337: [metadata wizards] LDAP Wizard password are mandatory (mhirt)
- 0002276: [Java components] [tFileOutputDelimited] Add an option to write data row by row and not block by block (mhirt)
- 0002298: [Java components] tAdvancedFileOutputXML lost its configuration Tree (mhirt)
- 0002340: [job designer editor] Losing ‘Encoding Type’ in a copy/paste operation. (nrousseau)
- 0002352: [properties] [Encoding] Impossible to edit the Custom encoding. (nrousseau)
- 0002320: [Java components] can’t put a tsortrow after a tdenormalize (mhirt)
- 0002334: [Java components] [tSendMail] SMTP port and context variable (mhirt)
- 0002194: [schema management] Export schema: few remarks (smallet)
- 0002332: [run job] TalendException constructor is undefined on a tCreateTable job in a muli thread execution (mhirt)
- 0002295: [Java components] “Rejects” Link from DBOutput (MysqlOutput) is always fired (mhirt)
- 0002251: [Java components] [tFirebirdInput/tFirebirdOutput] database property not filled with repository connection (mhirt)
- 0002301: [Java components] taggregateRow generate an exception with (integer, Short and Long) (mhirt)
- 0002326: [Perl components] Bug in tPostgresqlOutput : UPDATE doesn’t work (plegall)
- 0002240: [Java components] Schema not present in Guess Query for tOracleInput, and in INSERT INTO expression for tOracleOutput (nrousseau)
- 0002314: [schema management] No Default column in the schema dialog in Perl. (nrousseau)
- 0002287: [Java components] [tAggregateRow] Pb when renaming columns (nrousseau)
- 0002317: [DB Mappings] MySQL “datetime” mapped to “date” Perl Talend type instead of “datetime” (plegall)
- 0002291: [Java components] Java Oracle Components and wrong protecting char (”) (smallet)
- 0002188: [import/export job/project] Bad context for a child after export (smallet)
- 0002232: [properties] [tCreateTable] Component has a schema setted when using from scratch (smallet)
- 0002231: [Java components] JavaDemo tOracleSPInput not working (mhirt)
- 0002228: [Java components] tAS400Output Component Creates Inccorect SQL Statements (mhirt)
- 0002267: [sqlBuilder] In the db components, the option “Use an existing connection” is not used by the sqlBuilder. (nrousseau)
- 0002292: [general] [Preferences] Perl installation should be optional in the preferences in Java Mode. (nrousseau)
- 0002246: [graphical User Interface] Bad count of columns when you used a database function (nrousseau)
- 0002233: [Java components] [tContextLoad] the “Disable warnings” checkbox doen’t work (mhirt)
- 0002277: [context] [Context View] It should be impossible to edit the contexts in “Repository Mode”. (nrousseau)
- 0002253: [code generation] implicit tFlowMeter breaks tMap (xtan)
- 0002212: [Java components] tRowGenerator forgets values (nrousseau)
- 0002229: [Java components] there is a bugin tRSSnput in 2.2.0. (mhirt)
- 0002249: [Java components] tFileTouch must not delete file if exist only update modification time (mhirt)
- 0002209: [Java components] Header missing when you check IncludeHeader and Append mode (mhirt)
- 0002241: [metadata wizards] file wizards guesses “String” type instead of “string” (plegall)
- 0002225: [context] conflict when changing job while focus is on context view 2 (smallet)
- 0001947: [Java components] “deduplicate” option in tNormalize (cshan)

27
Sep

TOS 2.2.0 RC1 : We need you help for the newest Talend !

Dear Talend Community Member,

We are proud to inform you that Talend Open Studio 2.2.0 Release Candidate is now available. This version contains all the features of Talend Open Studio 2.2.0 and we need you to track all problems that might exist in your Open Source data integration tool, before its release.

What’s new in this version?

The numerous new features of Talend Open Studio 2.2.0 include:

  • enhancement of the management of contexts (GUI, new tContextDump component)
  • export jobs as Java Web Services
  • graphical expression builder

Talend Open Studio is now based on the latest Eclipse version (3.3), you can benefit from all the improvements of this new framework (including support for Windows Vista).

We have also integrated new components:

Java :

  • Support for more databases: AS/400 connector, generic JDBC connector
  • Slowly Changing Dimensions for MySQL, Oracle, Ingres, MS SQL, DB2, Sybase (support for types 1, 2 & 3, support for Surrogate Keys, etc.)
  • Support for stored procedures in Oracle, MS SQL, Ingres, MySQL, DB2
  • Connection sharing for Oracle and PostgreSQL
  • Support for LDIF/LDAP
  • “Wait for file” and “Wait for SQL Data” to start a job upon the apparition of a file or of certain records in a table
  • Flow merge and split (tUnite and tReplicate)
  • Support for SCP

Perl :

  • Multiple substitutions, simple and complex (tReplace)
  • Connection sharing for Oracle and PostgreSQL
  • Lookup with multiple matches
  • “Wait for file” and “Wait for SQL Data” to start a job upon the apparition of a file or of certain records in a table
  • Flow data metering
  • File touch
  • Flow merge and split (tUnite and tReplicate)
  • Support for SCP

Performance of complex jobs have been significantly improved with the passing of data structures as references. Check out this scenario to feel the performance enhancement: http://www.talendforge.org/wiki/doku.php?id=performances:scenario_3.

Please download and test this Release Candidate, read the documentation, go through the tutorials, chat with us on the Forum, suggest new features and report bugs on our Bugtracker, check out our technical documentations on the Wiki…

Joining the Talend community is the best way to influence the progress of your preferred data integration solution!

The download is available at www.talend.com/download.php (http://www.talend.com/download.php).
The community tools (Forum, Bugtracker, Changed Log, Wiki, Subversion, Trac, Flash tutorials…) are available at www.talendforge.org (http://www.talendforge.org).

Thanks again for your support and your involvement!
Best regards,

The Talend Team

06
Mar

Overview about Open Blue Lab

Accompanied by the OSEO anvar and the Regional Council of “Pays de la Loire”, BlueXML installs a technical platform to develop software bricks leading to the constitution of a CRM & open ERP source full Web, Openbluelab.

This Open source project provides lots of functional modules : Intranet/Extranet, CRM, Pay management, Portal, …

Technically they use Cocoon for Xml application server, and UML editors plugins from Eclipse community.

Openbluelab gives to users possibility to develop in a UML view his ERP model. The user creates his own Use case diagram, he can modify Classes diagram.
Then Openbluelab generates web portals applications from this UML model. It makes it possible to massively generate the information systems and ERP of the company.
Propose an guided approach by the models allowing a powerful and simple personalization of the whole of the elements and process of the application. To date 40 generic models are available.

This product is an accelerator of performances :

Industrialize the data-processing developments since the phase of design until the phase of deployment.The profit in times of development can go up to 70%.
BlueXML is studying about Talend Open Studio and SPAGO BI, to integrate these powerful products for integration and restitution in all projects Openbluelab.

Talend Open Studio would make it possible to migrate the data of the old ERP towards the new ones.

Regards.

Christophe ANTOINE





Downloads now!

Administration

  • Login
  • July 2008
    M T W T F S S
    « Oct    
     123456
    78910111213
    14151617181920
    21222324252627
    28293031  

    Forum

    Want to discuss Talend Open Studio? Having problems installing the software? Need help using the software? Want to make suggestions and discuss them with development team? The Talend Forum is the right place!

    Latest Comments

    Vote for our product

    Eclipse
    Vote for our product in
    eclipseplugincentral.com!
    Open Source
    Copyright © 2007 - 2008 Talend. All rights reserved. Talend Contributor Agreement