Pentaho Data Integration – PDI Tutorial 1 – TXT to XML

Diesen Beitrag gibt es auch auf deutsch

Within this tutorial we are going to set up our first simple Kettle transformation, in which incoming data from a comma-seperated text file will be transformed into an XML output file. Our text file data source looks, for example, like the following record set from our movie collection:

ktxen_TextdatenInputEN

At first we start the Spoon application, then we open a new transformation on the File menu and we choose the „text file input“ step from the input elements on the left side menu and pull it onto the editor window.

ktx_TextFileInput

By double-clicking this step we open the step’s configuration. On the first tab, we will have to select the above-mentioned text file by using the file dialog. In the next „Content“ tab further adjustments can be made regarding the format of the text file, such as the separating character or whether header lines are present or not. But the default settings are just right for our example. On the „Fields“ tab, all the data columns within the text file can be declared. By pressing the button „Get Fields“, Kettle will do the job for you automatically by analyzing the first few lines of your data file.

ktxen_TextFileInputEN

The „Years“ field will at first be recognized as an integer number, but I would recommend you to change it back to string content just like all the other fields, to stay able to process any general content and not to run into any conversation errors. Also, for the same reason, the field length, which is set only to the maximum of the analyzed data samples, should be extended generously – as our data is clearly separated by semicolon this will not cause any trouble. The string trim type „both“ will cause any existing spaces at the beginning or at the end of the data fields to be removed already when the file will be processed.

ktxen_TextInputGetFieldsBearbeitetEN

That is it so far – for the data input. For the output part of our transformation, we will choose an „XML output“ step out of the designer’s Output section, pull it onto the editor window again and connect it to the text input by using the middle mouse button or the step’s context menu.

ktx_XMLOutputDazu

Again we open the step’s configuration by double-clicking on the output step. On the first tab, the output file name has to be specified again by using the standard file dialog. On the „Content“ tab, we can adjust the name of the XML root node and the subsequent nodes to our content. I am going to set it to „Movies“ and „Movie“ within this example.

ktxen_XMLOutputKonfigEN

On the „Fields“ tab, all the data fields can be recognized automatically by pressing the „Get Fields“ button again – as they are already defined in our data stream via the step connection. The field names will correspond to the outcoming fields of the „text input“ step – you can check this by right clicking on the step. We will have to define for each of the data fields whether it will be written out as an XML element or as an XML attribute.

ktxen_XMKonfigurieren2EN

And we are ready! Now, go back to the editor window, click on the „Play“ button on the top left to start the transformation, then we click on „Launch“ again in the following configuration dialog. Before starting it, Kettle will ask us to save our new transformation – if we did not do so already before. After the transformation is finished, all steps should be marked with a green check mark to show you that no errors did appear. The tab „Execution Results“ below the editor will indicate for each step the number of processed rows. In case of an error, the affected step will be highlighted in red and more details will be written down onto the „Logging“ tab for troubleshooting.

ktx_TrafoExecutionResults

Finally, we open the output file which we defined in the XML output step and take a closer look at the result. As expected, the file contains one „Movies“ XML root element, which then contains all the four „Movie“ entries as subsequent XML nodes. The „Duration“ and „Year“ fields are attributes of each of one „Movie“ element, whereas the „Genre“ and „Rack“ data fields are separate and subordinate XML nodes.

ktxen_outputEN

Mein Kommentar...