Data Extraction Example -- Processing CVS logs
Data extraction from text files is a common task in IT departments, whether it is for generating reports, loading into a database (ETL - Extraction Transformation Loading), processing log files, or for a variety of other purposes. In this article, we examine how to use Ferrite to extract some useful information from a CVS Log. Important points highlighted in this article include:
  • Records spanning more than one line can be selected by specifying a suitable record delimiter.
  • Only those records that match a regular expression are selected for output while supressing the rest.
  • The pattern we are searching for spans more than one line within the record.
Introduction
CVS stands for Concurrent Version System and is a tool widely used for version management of files. The log extracted from a typical CVS system is shown here. Each log record is delimited by a series of equals characters (=) as shown. Within each log record, one or more log messages could be present, each log message delimited by hyphen characters (-). We would like to extract only those log records which contain more than one log message since log records with a single log message indicate that the corresponding version-controlled file has not been changed since the previous release. In this picture, a log record with a single log message has been indicated.
Specify the Record Delimiter
We begin by creating a workflow (called cvsLog) and selecting the CVS log file to process. In the File Selection dialog box, click Next to specify the record delimiter. Select the radio button: Specify Regular Expression delimiting records and enter the pattern ^=+\n$. This pattern specifies that a line containing only equals (=) is the record delimiter. Click Finish.
Output Log Records matching a Regex
Next we need to add a Record Editor filter to output only those log records that match a regular expression. Select Output from the Record Editor dialog box, and uncheck the checkbox entitled Enable Default Output. In the address selection box, select the option Records Matching Regex: , and click Set.
Specify Regex matching required Log Records
We specify the regular expression for matching records that contain more than one log message: ^-+$.*^-+$. Below is a step-by-step analysis of the regular expression pattern and the required flags.
  • The character ^ anchors the search to the beginning of the line. The pattern ^-+ means match one or more hyphen characters (-) starting from the beginning.
  • Next, the character $ indicates end-of-line. So the pattern ^-+$ matches those lines that consist entirely of hyphen characters (-).
  • The next addition to the regular expression is .* which means "match anything". So the pattern ^-+$.* means any text following a line consisting entirely of hyphens.
  • Adding the pattern ^-+$ matches those records which contain any text between two lines consisting entirely of hyphen characters.
  • Additionally, we need to select the regular expression flag: '.' Matches Newline. When we specified the pattern .* to match any text, ordinarily the line terminator character(s) would not be matched. Thus, .* would match any text on a single line. To match text on more than one line using .*, we need to turn on this option.
  • Another option which needs to be turned on is: Multiline Mode. This option specifies that the anchor characters ^ and $ match not only at the beginning and end of the record respectively, but also before and after any line terminator characters within the record. We need this option since we are attempting to specify text between two lines consisting entirely of hyphen characters (-).
Record Editor Configuration
Once the previous step is completed by clicking Finish, we have the Record Editor main window as shown here. Click Finish once more and execute the pipeline. The results indicate that we have selected only those records which have more than one log message.
Summary
This article illustrated the facilities in Ferrite for matching patterns spanning more than a single line. This capability is useful when extracting information from emails, datamining web sites, converting text to HTML or XML, ETL tasks (Extraction, Transformation and Loading), and more.








































Ferrite Platform
version 1.5
US$49
90-day money back guarantee
Free Upgrades for 1 year!

Ferrite Platform
version 1.2
Free Download
Try it risk-free for 14 days