Extracting Tabular Data from Text Documents
Extracting row and column data from text documents is an important requirement in the IT departments of many enterprises. This requirement arises from the need to be able to import data written by disparate applications into enterprise databases. A simple format such as a tab-delimited format or CSV (comma-delimited) format can be handled by applications such as Microsoft Excel as well as applications like SQL Loaders which are a part of many databases. However, when text documents require extensive data massaging for a successful import, these tools are usually found lacking. In these situations, enterprises are faced with the situation of requiring programming in a low-level language using expensive IT resources. Ferrite, with its extensive data massaging features, is uniquely positioned to solve these types of problems with a minimum of effort. This article attempts to illustrate some simple processing for converting a text document into data usable by another application.
Sample Data
Assume the following is a text document produced by another application, and we would like to convert this data into row-column data so it can be exported as a CSV file. In the following display, line numbers have been added for the purpose of illustration for this article. Some observations about this data
  • The file is tab-delimited. However, some numeric columns are enclosed in quotes, and have embedded commas(,), dollar characters($) and whitespace.
  • Lines 4, 9, 14 and 17 appear to be empty, but are not. There are some space and tab characters on these lines.
 1|Sample Ferrite Data					
 2|Month	Q1	Q2	Q3	Q4	Total
 3|Seasonality	0.9	1.1	0.8	1.2	
 4|					
 5|Units Sold	"3,592 "	"4,390 "	"3,192 "	"4,789 "	"15,962 "
 6|Sales Revenue	"$143,662 "	"$175,587 "	"$127,700 "	"$191,549 "	"$638,498 "
 7|Cost of Sales	"89,789 "	"109,742 "	"79,812 "	"119,718 "	"399,061 "
 8|Gross Margin	"53,873 "	"65,845 "	"47,887 "	"71,831 "	"239,437 "
 9|					
10|Salesforce	"8,000 "	"8,000 "	"9,000 "	"9,000 "	"34,000 "
11|Advertising	"10,000 "	"10,000 "	"10,000 "	"10,000 "	"40,000 "
12|Corp Overhead	"21,549 "	"26,338 "	"19,155 "	"28,732 "	"95,775 "
13|Total Costs	"39,549 "	"44,338 "	"38,155 "	"47,732 "	"169,775 "
14|					
15|Prod. Profit	"$14,324 "	"$21,507 "	"$9,732 "	"$24,099 "	"$69,662 "
16|Profit Margin	10%	12%	8%	13%	11%
17|	
18|Product Price	$40.00 
19|Product Cost	$25.00 
Creating a Workflow
  1. Create a new workflow (also called a pipeline) by selecting Create -> File Processor from the main menu. Select a project where the pipeline is to be created and click Next. Enter a name for the new pipeline, select a suitable working directory, or accept the default. For now, leave the View Output option at As Plain Text and click Next.
  2. Select the file to be processed by clicking Add Files and navigating to the file path. Specify the file encoding if required, or leave it at the Use Platform Default option.
  3. At this point, double clicking on a file name in the files list box brings up the file viewer. Note that if the document encoding is specified, it is used to open the file for viewing.
  4. At this point, clicking Finish opens the pipeline in the Pipeline Editor. The menu bar and toolbar are also updated with relevant items. Clicking the Run button executes the pipeline, but since we have not yet added any filters, the output appears unchanged in the console viewer.
Using the Record Editor to cleanup text
  1. On observing the output, we discover that we would like to delete the first line, so let us add a filter for doing so. Select Advanced -> Record Editor from the main menu. In the Record Editor main window, the buttons on the left side are used to add processing rules. Click the Delete button to add a delete command. Select Record Numbered ... in the address window, and enter 1 in the text box and click Finish
  2. The rule we added appears in the Record Editor Main Window. Ensure that Enable Default Output is checked. This ensures that lines which are not selected by any processing rule appears in the output.
  3. Click the Run button again to see the resultant output. Note that the first line has been deleted from the output.
  4. Let us now add a rule to delete blank lines in the output. On viewing the input file in the text file viewer, we notice that the blank lines actually contain whitespace as indicated by the blue highlighting.
  5. Click the Delete button on the left side of the Record Editor main window again to add a delete command. Select the Record(s) Matching Regex ... option and click Set. In the Regex Window, enter the pattern for matching blank lines containing optional whitespaces: ^\s*$.
  6. Click Finish to complete the setup. The rule we just added appears in the record editor main window.
  7. Executing the pipeline results in the output shown. All blank lines, with or without whitespaces, have been deleted.
Extract rows and columns from text data
  1. Let us now use the Field Processor filter to convert the cleaned up text into rows and columns of data. Select Advanced -> Field Processor from the main menu. In the field processor main window, click Add to add a processing rule. Select All Records in the record selection window, and click Next.
  2. Add the following code to print all fields from the input channel to the output channel.
    print();
    
    Click Check for Errors to check for syntax errors in the entered statement, and click Finish.
  3. Once the rule is added, the field processor main window shows the rule.
  4. At this point, the input data is plain text and has not yet been converted into rows and column. We need to specify the field delimiter to split each input record into fields. Click Next in the field processor main window to specify the field splitter, and select Use Regular Expression under Field Separator. Click Set Regular Expression under Regular Expression Options and specify the regular expression. Also select the option to discard the separator since we do not want the separator to be a part of the field.
  5. Running the pipeline results in the output shown. Since we are still viewing the output in the plain text viewer, the rows and columns in the data are not evident. This is because viewing field data as plain text results in concatenation of all the fields together. To view the data in a table, we need to use the Table Viewer.
Table Viewer
  1. To specify the use of the Table Viewer to view the output, we need to change the default terminator for the pipeline. Right click on the pipeline in the Project Browser, and select Properties from the context menu.
  2. In the pipeline properties window, set the output viewer to view in tabular format.
  3. Execute the pipeline to view the output as tabular data.
Delete characters from field values
  1. At this point, we have successfully processed text data to extract record and field data. However, some of the fields contain characters that we would like to get rid of such as: comma(,), dollar($), space and double quotes("). Let us modify the pipeline to eliminate these as well.
  2. To modify the field values, we can make changes to the Field Processor configuration and modify the field values directly before writing it to the output channel. Double click on the Field Processor filter within the workflow editor, or select Properties from the context menu (on right mouse button click).
  3. In order to delete unwanted characters from field values, we can use a Javascript regular expression. Let us define a variable storing the regex inside the BEGIN block of the Field Processor. Click Add from the Field Processor main window, and select the option to add a processing statement at the beginning before processing any records.
  4. Click Next and define a Javascript variable using the statement: var regex = /[$",\s]/g;. This statement defines a Javascript regular expression to select comma(,), dollar($), whitespace and double quotes("). The g flag at the end of the regular expression selects all instances of these characters within the input string. For more details, consult any Javascript Reference.
  5. Let us now modify the earlier rule we used to print all the fields. Select the rule within the Field Processor main window and click Modify, or double click on the rule.
  6. Leave the record selection page of the wizard at All records and click Next.
  7. Change the text of the rule to read as shown below. This code deletes the characters matched by the regex for all the fields starting with the second field. The final print() statement echoes the field values as before. Click Finish to accept the changes.
    for (var i = 2; i < NF; i++) {
        input[i] = input[i].replace(regex, "");
    }
    print();
    
  8. Click Finish once more to accept the changes to Field Processor configuration, and execute the pipeline. We now see that all undesirable characters in the fields we specified have been removed as desired.
Save Output as CSV File
  1. Now that we have the output in the desired format, let us save the output as a CSV file. Select Output -> Save in CSV Format from the main menu.
  2. Enter the name of the file to save output to, select a backup extension, and select the output encoding if desired. Click Next to set the CSV Output options.
  3. In this window, select the field delimiter and click Finish to add the filter.
  4. Execute the pipeline. The output file is written, backing up the file if it exists. The final output is shown below.
    Month,Q1,Q2,Q3,Q4,Total
    Seasonality,0.9,1.1,0.8,1.2,
    Units Sold,3592,4390,3192,4789,15962
    Sales Revenue,143662,175587,127700,191549,638498
    Cost of Sales,89789,109742,79812,119718,399061
    Gross Margin,53873,65845,47887,71831,239437
    Salesforce,8000,8000,9000,9000,34000
    Advertising,10000,10000,10000,10000,40000
    Corp Overhead,21549,26338,19155,28732,95775
    Total Costs,39549,44338,38155,47732,169775
    Prod. Profit,14324,21507,9732,24099,69662
    Profit Margin,10%,12%,8%,13%,11%
    Product Price,40.00
    Product Cost,25.00
    
Summary
  • We have processed a text document to perform some simple text manipulation, and converted the data to CSV output.
  • In doing so, we created a workflow capturing the actions required for this purpose. This means that this workflow definition can be stored and re-executed with a single click, thus lending itself to automation.
  • To apply the same transformation to different file(s), we just need to change the input file(s), and possibly the output file(s).
This article has briefly touched upon the capabilities of Ferrite for use in automating general purpose text manipulation. We hope this has excited your interest sufficiently to download Ferrite today and try it for yourself.








































Ferrite Platform
version 1.5
US$49
90-day money back guarantee
Free Upgrades for 1 year!

Ferrite Platform
version 1.2
Free Download
Try it risk-free for 14 days