Converting Text to HTML
Introduction
Conversion of text documents to HTML is commonly required by
organizations which have a number of documents in plain text format
but would like to publish such documents on the web. While there exist
any number of tools offering one-click conversion to HTML, these tools
typically require many hours of manual work when the documents
need to be pre-processed before attempting HTML conversion. For
these situations, Ferrite, with its extensive text processing and
manipulation facilities offers a one-stop solution. This document
attempts to explain a few of these features.
Article Overview
Ferrite supports a large number of text transformation primitives "out
of the box". This tutorial demonstrates how to use some of these
primitives to parse and transform plain text to HTML. The text to be
converted can be viewed here.
In this tutorial, we examine the following features:
- Using a regular expression as the record delimiter.
- Find/replace regular expression patterns.
- Customization of the search and replace facilities.
- Adding headers and footers.
Creating a Workflow
Create a File Processor workflow and select the file to be processed.
For details on creating a File Processor workflow, check out the Getting Started
tutorial.
Once the workflow has been created, it is opened in the workflow
editor as shown.
Change the Record Delimiter
Text manipulation in Ferrite proceeds by processing a record at a
time. By default, each line of the input is treated as a
record. However, it is possible to specify a regular expression to be
used as the record delimiter. This facility is useful in many
instances:
- Processing text by paragraphs instead of line-by-line.
- Specify a period (.) as the record delimiter to
process text by sentences.
- Using HTML/XML tag endings as record delimiters.
To change the record delimiter:
- Double-click on the "File Selector"
filter within the workflow editor.
- Click Next in the File Selection wizard.
- Change the record separator option to "Specify regular expression
delimiting records".
- Enter the following regular expression as the record
delimiter. This regular expression specifies that two or more
consecutive line-terminators delimit a record. The line-terminator can
be CRLF (for Windows) or LF (for Unix/Linux).
(\r?\n){2,}
- Ensure that the "Handling the separator" options is set to "The
separator is a part of the preceding record"
Add Line Numbering
To verify whether the records are being scanned properly, select Add
-> Line Numbering from the main menu. This filter prefixes each
record by its index from the beginning.
Note: This filter is called "Line Numbering" because the default
mode of processing is line-by-line. A more appropriate name for this
filter would be "Record Numbering", but "Line Numbering" was choosen
for simplicity.
Executing the workflow shows that records are now scanned by
paragraphs as desired.
Scanning for header text
On closely observing the input text, we notice that a line consisting
entirely of upper-case letters is a section header. So let us perform
a regular expression search and replace to convert this text to a
HTML header.
Select "Find/Replace" -> "Pattern within line .." from the main
menu, and enter the following regular expression to search for. Also
select the option "Multiline Mode" since the record may contain one or
more line terminators.
^\s*([A-Z ,.]+)\s*$
The following table explains the regular expression in more detail.
| Regex Component |
Meaning |
| ^ |
Begin the search at the start of the record. |
| \s* |
Match zero or more whitespace characters. Whitespace characters
normally include space and tab characters. However, since we have
turned on multiline-mode, it includes line-terminators too.
|
| [A-Z ,.]+ |
Include any combination of one or more uppercase characters,
commas (,), spaces, and periods (.) |
| ([A-Z ,.]+) |
The above pattern is enclosed in parantheses since
we want to use the matched substring in the replacement. Thus, the
substring from the first pair of parantheses is available in the
replacement string as $1. |
| \s*$ |
Match zero or more whitespace characters at the end of the
record. Since this component of the regex pattern is outside the
parantheses, any trailing whitespace characters are not included in
the replacement. |
Click Next to specify the replacement.
Convert to a HTML header
We would like to convert matched records to HTML headers. So we
specify the following replacement. Note that we are using a part of
the text matched in the replacement as $1.
<h2 class="heading">$1</h2>
Converting paragraphs
While we have converted records consisting entirely of uppercase
characters as headers, we would also like to convert the other records
as HTML paragraphs. For doing this, we need to modify the filter added
in the above step for converting the header. Records that were NOT
converted as a header need to be converted to paragraphs.
Double-click on the filter named "Replace within line ..". The Field
Processor wizard (Field Processor is the underlying filter for most of
the text conversion and manipulation primitives) is shown. It contains
a rule for converting records matching a pattern to headers as
shown. This rule is shown with the pattern:
rx.test(input[0]).
Modifying the DEFAULT rule
To convert records that were not converted as header records, we need
to modify the rule marked "DEFAULT". To do so, double click on the
rule. The following pattern selection wizard appears. Note that the
option "Apply rule to unprocessed records" is selected. Click Next to
change the action associated with the rule.
Action for DEFAULT rule
The action currently associated with the DEFAULT rule outputs the
current record without any modification.
We need to change this action to the following. This outputs the
current record enclosed within the HTML paragraph tags.
println("<p>" + input[0] + "</p>");
Add Header and Footer
Let us add a header consisting of HTML style information to the
output. Select Add -> Header and enter the text shown in the dialog
box.
To add a footer , select Add -> Footer and enter the text shown.
Save the output
Executing the pipeline results in applying the filters that we added
above. To save the output, select Output -> Save As from the main
menu. Select the output file by clicking browse and navigating to the
output directory.
We also need to specify a suffix for the backup file. The output file,
if it exists, is backed up each time the pipeline is executed using
the suffix along with a numbering scheme.
To save the output in Unicode or any of the other supported encodings,
select the encoding from the "Use Encoding" table. Click Finish to add
the "Save As" filter.
Execute and Preview Output
Click the execute button to run the pipeline. The output is saved to
the specified file. Open the file in a browser and verify that the
transformations have been applied to satisfaction.
|