|
Extracting Tabular Data from Text Documents
Extracting row and column data from text documents is an important
requirement in the IT departments of many enterprises. This
requirement arises from the need to be able to import data written by
disparate applications into enterprise databases. A simple format such
as a tab-delimited format or CSV (comma-delimited) format can be
handled by applications such as Microsoft Excel as well as
applications like SQL Loaders which are a part of many
databases. However, when text documents require extensive data
massaging for a successful import, these tools are usually found
lacking. In these situations, enterprises are faced with the situation
of requiring programming in a low-level language using expensive IT
resources. Ferrite, with its extensive data massaging features, is
uniquely positioned to solve these types of problems with a minimum of
effort. This article attempts to illustrate some simple processing for
converting a text document into data usable by another application.
Sample Data
Assume the following is a text document produced by another
application, and we would like to convert this data into row-column
data so it can be exported as a CSV file. In the following display,
line numbers have been added for the purpose of illustration for this
article.
Some observations about this data
- The file is tab-delimited. However, some numeric columns
are enclosed in quotes, and have embedded commas(,), dollar
characters($) and whitespace.
- Lines 4, 9, 14 and 17 appear to be empty, but are not. There are
some space and tab characters on these lines.
1|Sample Ferrite Data
2|Month Q1 Q2 Q3 Q4 Total
3|Seasonality 0.9 1.1 0.8 1.2
4|
5|Units Sold "3,592 " "4,390 " "3,192 " "4,789 " "15,962 "
6|Sales Revenue "$143,662 " "$175,587 " "$127,700 " "$191,549 " "$638,498 "
7|Cost of Sales "89,789 " "109,742 " "79,812 " "119,718 " "399,061 "
8|Gross Margin "53,873 " "65,845 " "47,887 " "71,831 " "239,437 "
9|
10|Salesforce "8,000 " "8,000 " "9,000 " "9,000 " "34,000 "
11|Advertising "10,000 " "10,000 " "10,000 " "10,000 " "40,000 "
12|Corp Overhead "21,549 " "26,338 " "19,155 " "28,732 " "95,775 "
13|Total Costs "39,549 " "44,338 " "38,155 " "47,732 " "169,775 "
14|
15|Prod. Profit "$14,324 " "$21,507 " "$9,732 " "$24,099 " "$69,662 "
16|Profit Margin 10% 12% 8% 13% 11%
17|
18|Product Price $40.00
19|Product Cost $25.00
Creating a Workflow
- Create
a new workflow (also called a pipeline) by selecting Create -> File Processor from the main
menu. Select a project where the pipeline is to be created and click
Next. Enter a name for the new pipeline,
select a suitable working directory, or accept the default. For now,
leave the View Output option at As Plain Text and click Next.
- Select
the file to be processed by clicking Add
Files and navigating to the file path. Specify the file
encoding if required, or leave it at the Use
Platform Default option.
- At this point, double clicking on a file name in the files list
box brings up the file
viewer. Note that if the document encoding is specified, it is
used to open the file for viewing.
- At this point, clicking Finish opens
the pipeline in the Pipeline Editor. The menu bar and toolbar are also
updated with relevant items. Clicking the Run button executes the pipeline, but since we
have not yet added any filters, the
output appears unchanged in the console viewer.
Using the Record Editor to cleanup text
- On observing the output, we discover that we would like to delete
the first line, so let us add a filter for doing so. Select
Advanced -> Record Editor from the main menu. In the
Record Editor main window, the buttons on the left side are used to
add processing rules. Click the Delete button to add a
delete command. Select Record Numbered ... in the address
window, and enter
1 in the text box and click
Finish
- The rule we added appears
in the Record Editor Main Window. Ensure that Enable Default Output is checked. This ensures
that lines which are not selected by any processing rule appears in
the output.
- Click the Run button again to see
the resultant output. Note that the first line has been deleted
from the output.
- Let us now add a rule to delete blank lines in the output. On viewing
the input file in the text file viewer, we notice that the blank
lines actually contain whitespace as indicated by the blue
highlighting.
- Click the Delete button on the left side of the
Record Editor main window again to add a delete command. Select
the Record(s) Matching Regex ... option and click
Set. In the Regex
Window, enter the pattern for matching blank lines containing
optional whitespaces:
^\s*$.
- Click Finish to complete the setup. The rule we just
added appears in the record
editor main window.
- Executing the pipeline results in the output shown. All
blank lines, with or without whitespaces, have been deleted.
Extract rows and columns from text data
- Let us now use the Field Processor filter to convert the cleaned
up text into rows and columns of data. Select Advanced ->
Field Processor from the main menu. In the field processor main
window, click Add to add a processing rule. Select
All Records in the record selection window, and click
Next.
- Add the following code to print
all fields from the input channel to the output channel.
print();
Click Check for Errors to check for syntax errors in the
entered statement, and click Finish.
- Once the rule is added, the field processor main window shows
the rule.
- At this point, the input data is plain text and has not yet been
converted into rows and column. We need to specify the field delimiter
to split each input record into fields. Click Next in the
field processor main window to specify the field splitter, and select
Use Regular Expression under Field
Separator. Click Set Regular Expression under
Regular Expression Options and specify
the regular expression. Also select
the option to discard the separator since we do not want the
separator to be a part of the field.
- Running the pipeline results in the output shown. Since
we are still viewing the output in the plain text viewer, the rows and
columns in the data are not evident. This is because viewing field
data as plain text results in concatenation of all the fields
together. To view the data in a table, we need to use the Table
Viewer.
Table Viewer
- To specify the use of the Table Viewer to view the output, we
need to change the default terminator for the pipeline. Right click on
the pipeline in the Project Browser, and select
Properties from the context menu.
- In the pipeline
properties window, set the output viewer to view in tabular
format.
- Execute the pipeline to view
the output as tabular data.
Delete characters from field values
- At this point, we have successfully processed text data to
extract record and field data. However, some of the fields contain
characters that we would like to get rid of such as:
comma(,), dollar($), space and double
quotes("). Let us modify the pipeline to eliminate
these as well.
- To modify the field values, we can make changes to the Field
Processor configuration and modify the field values directly before
writing it to the output channel. Double click on the Field Processor
filter within the workflow editor, or select Properties
from the context menu (on right mouse button click).
- In order to delete unwanted characters from field values, we can
use a Javascript regular expression. Let us define a
variable storing the regex inside the BEGIN block of the Field
Processor. Click Add from the Field Processor main
window, and select the option to add a processing statement at the
beginning before processing any records.
- Click Next and define a Javascript variable using
the statement:
var regex = /[$",\s]/g;. This
statement defines a Javascript regular expression to select
comma(,), dollar($), whitespace and double
quotes("). The g flag at the end of the
regular expression selects all instances of these characters within
the input string. For more details, consult any Javascript Reference.
- Let us now modify the earlier rule we used to print all the
fields. Select the rule within the Field Processor main window and
click Modify, or double click on the rule.
- Leave the record selection page of the wizard at All
records and click Next.
- Change the text of the rule to read as shown below. This code
deletes the characters matched by the regex for all the fields
starting with the second field. The final print()
statement echoes the field values as before. Click Finish
to accept the changes.
for (var i = 2; i < NF; i++) {
input[i] = input[i].replace(regex, "");
}
print();
- Click Finish once more to accept the changes to
Field Processor configuration, and execute the pipeline. We now see
that all undesirable characters in the fields we specified have been
removed as desired.
Save Output as CSV File
- Now that we have the output in the desired format, let us save
the output as a CSV file. Select Output -> Save in CSV
Format from the main menu.
- Enter the name of the file to save output to, select a backup
extension, and select the output encoding if desired. Click
Next to set the CSV Output options.
- In this window, select the field delimiter and click
Finish to add the filter.
- Execute the pipeline. The output file is written, backing up
the file if it exists. The final output is shown below.
Month,Q1,Q2,Q3,Q4,Total
Seasonality,0.9,1.1,0.8,1.2,
Units Sold,3592,4390,3192,4789,15962
Sales Revenue,143662,175587,127700,191549,638498
Cost of Sales,89789,109742,79812,119718,399061
Gross Margin,53873,65845,47887,71831,239437
Salesforce,8000,8000,9000,9000,34000
Advertising,10000,10000,10000,10000,40000
Corp Overhead,21549,26338,19155,28732,95775
Total Costs,39549,44338,38155,47732,169775
Prod. Profit,14324,21507,9732,24099,69662
Profit Margin,10%,12%,8%,13%,11%
Product Price,40.00
Product Cost,25.00
Summary
- We have processed a text document to perform some simple text
manipulation, and converted the data to CSV output.
- In doing so, we created a workflow capturing the actions required
for this purpose. This means that this workflow definition can be
stored and re-executed with a single click, thus lending itself to
automation.
- To apply the same transformation to different file(s), we just
need to change the input file(s), and possibly the output file(s).
This article has briefly touched upon the capabilities of Ferrite for
use in automating general purpose text manipulation. We hope this has
excited your interest sufficiently to download Ferrite today and try
it for yourself.
|
Ferrite Platform version 1.5 |
| US$49 |
| 90-day money back guarantee |
| Free Upgrades for 1 year! |
 |
|
Ferrite Platform version 1.2 |
| Free Download |
| Try it risk-free for 14 days |
 |
|
|