|
Data Extraction Example -- Processing CVS logs
Data extraction from text files is a common task in IT departments,
whether it is for generating reports, loading into a database (ETL -
Extraction Transformation Loading), processing log files, or for a
variety of other purposes. In this article, we examine how to use
Ferrite to extract some useful information from a CVS Log. Important
points highlighted in this article include:
- Records spanning more than one line can be selected by specifying
a suitable record delimiter.
- Only those records that match a regular expression are selected
for output while supressing the rest.
- The pattern we are searching for spans more than one line within
the record.
Introduction
CVS stands for Concurrent Version System and is a tool widely used for
version management of files. The log extracted from a typical CVS
system is shown here.
Each log record is delimited by a series of equals characters
( =) as shown. Within each log record, one or more log
messages could be present, each log message delimited by hyphen
characters ( -). We would like to extract only those log
records which contain more than one log message since log records with
a single log message indicate that the corresponding
version-controlled file has not been changed since the previous
release. In this
picture, a log record with a single log message has been
indicated.
Specify the Record Delimiter
We begin by creating a workflow (called cvsLog) and
selecting the CVS log file to process. In the File Selection dialog
box, click Next to specify
the record delimiter. Select the radio button: Specify Regular Expression delimiting records
and enter the pattern ^=+\n$. This pattern specifies that
a line containing only equals ( =) is the record
delimiter. Click Finish.
Output Log Records matching a Regex
Next we need to add a Record Editor filter to output only those log
records that match a regular expression. Select Output from the Record Editor dialog box, and
uncheck the checkbox entitled Enable Default
Output. In the address
selection box, select the option Records
Matching Regex: , and click Set.
Specify Regex matching required Log
Records
We specify
the regular expression for matching records that contain more than
one log message: ^-+$.*^-+$. Below is a step-by-step
analysis of the regular expression pattern and the required flags.
- The character ^ anchors the search to the beginning of
the line. The pattern ^-+ means match one or more hyphen
characters (-) starting from the beginning.
- Next, the character $ indicates end-of-line. So the
pattern ^-+$ matches those lines that consist entirely of
hyphen characters (-).
- The next addition to the regular expression is .*
which means "match anything". So the pattern ^-+$.* means
any text following a line consisting entirely of
hyphens.
- Adding the pattern ^-+$ matches those records which
contain any text between two lines consisting
entirely of hyphen characters.
- Additionally, we need to select the regular expression flag:
'.' Matches Newline. When we specified the pattern
.* to match any text, ordinarily the line terminator
character(s) would not be matched. Thus, .* would match
any text on a single line. To match text on more than one line using
.*, we need to turn on this option.
- Another option which needs to be turned on is: Multiline Mode. This option specifies that the
anchor characters ^ and $ match not only at the
beginning and end of the record respectively, but also before and
after any line terminator characters within the record. We need this
option since we are attempting to specify text between two lines
consisting entirely of hyphen characters (-).
Record Editor Configuration
Once the previous step is completed by clicking Finish, we have the Record Editor main window
as shown here. Click
Finish once more and execute the
pipeline. The results indicate
that we have selected only those records which have more than one log
message.
Summary
This article illustrated the facilities in Ferrite for matching
patterns spanning more than a single line. This capability is useful
when extracting information from emails, datamining web sites,
converting text to HTML or XML, ETL tasks (Extraction, Transformation
and Loading), and more.
|