Converting Text to HTML

Introduction

Conversion of text documents to HTML is commonly required by organizations which have a number of documents in plain text format but would like to publish such documents on the web. While there exist any number of tools offering one-click conversion to HTML, these tools typically require many hours of manual work when the documents need to be pre-processed before attempting HTML conversion. For these situations, Ferrite, with its extensive text processing and manipulation facilities offers a one-stop solution. This document attempts to explain a few of these features.

Article Overview

Ferrite supports a large number of text transformation primitives "out of the box". This tutorial demonstrates how to use some of these primitives to parse and transform plain text to HTML. The text to be converted can be viewed here.

In this tutorial, we examine the following features:

  • Using a regular expression as the record delimiter.
  • Find/replace regular expression patterns.
  • Customization of the search and replace facilities.
  • Adding headers and footers.

Creating a Workflow

Create a File Processor workflow and select the file to be processed. For details on creating a File Processor workflow, check out the Getting Started tutorial.

Once the workflow has been created, it is opened in the workflow editor as shown.

Change the Record Delimiter

Text manipulation in Ferrite proceeds by processing a record at a time. By default, each line of the input is treated as a record. However, it is possible to specify a regular expression to be used as the record delimiter. This facility is useful in many instances:
  • Processing text by paragraphs instead of line-by-line.
  • Specify a period (.) as the record delimiter to process text by sentences.
  • Using HTML/XML tag endings as record delimiters.

To change the record delimiter:

  • Double-click on the "File Selector" filter within the workflow editor.
  • Click Next in the File Selection wizard.
  • Change the record separator option to "Specify regular expression delimiting records".
  • Enter the following regular expression as the record delimiter. This regular expression specifies that two or more consecutive line-terminators delimit a record. The line-terminator can be CRLF (for Windows) or LF (for Unix/Linux).
    (\r?\n){2,}
    
  • Ensure that the "Handling the separator" options is set to "The separator is a part of the preceding record"

Add Line Numbering

To verify whether the records are being scanned properly, select Add -> Line Numbering from the main menu. This filter prefixes each record by its index from the beginning.

Note: This filter is called "Line Numbering" because the default mode of processing is line-by-line. A more appropriate name for this filter would be "Record Numbering", but "Line Numbering" was choosen for simplicity.

Executing the workflow shows that records are now scanned by paragraphs as desired.

Scanning for header text

On closely observing the input text, we notice that a line consisting entirely of upper-case letters is a section header. So let us perform a regular expression search and replace to convert this text to a HTML header.

Select "Find/Replace" -> "Pattern within line .." from the main menu, and enter the following regular expression to search for. Also select the option "Multiline Mode" since the record may contain one or more line terminators.

^\s*([A-Z ,.]+)\s*$
The following table explains the regular expression in more detail.
Regex Component Meaning
^ Begin the search at the start of the record.
\s* Match zero or more whitespace characters. Whitespace characters normally include space and tab characters. However, since we have turned on multiline-mode, it includes line-terminators too.
[A-Z ,.]+ Include any combination of one or more uppercase characters, commas (,), spaces, and periods (.)
([A-Z ,.]+) The above pattern is enclosed in parantheses since we want to use the matched substring in the replacement. Thus, the substring from the first pair of parantheses is available in the replacement string as $1.
\s*$ Match zero or more whitespace characters at the end of the record. Since this component of the regex pattern is outside the parantheses, any trailing whitespace characters are not included in the replacement.

Click Next to specify the replacement.

Convert to a HTML header

We would like to convert matched records to HTML headers. So we specify the following replacement. Note that we are using a part of the text matched in the replacement as $1.
<h2 class="heading">$1</h2>

Converting paragraphs

While we have converted records consisting entirely of uppercase characters as headers, we would also like to convert the other records as HTML paragraphs. For doing this, we need to modify the filter added in the above step for converting the header. Records that were NOT converted as a header need to be converted to paragraphs.

Double-click on the filter named "Replace within line ..". The Field Processor wizard (Field Processor is the underlying filter for most of the text conversion and manipulation primitives) is shown. It contains a rule for converting records matching a pattern to headers as shown. This rule is shown with the pattern: rx.test(input[0]).

Modifying the DEFAULT rule

To convert records that were not converted as header records, we need to modify the rule marked "DEFAULT". To do so, double click on the rule. The following pattern selection wizard appears. Note that the option "Apply rule to unprocessed records" is selected. Click Next to change the action associated with the rule.

Action for DEFAULT rule

The action currently associated with the DEFAULT rule outputs the current record without any modification.

We need to change this action to the following. This outputs the current record enclosed within the HTML paragraph tags.

println("<p>" + input[0] + "</p>");

Add Header and Footer

Let us add a header consisting of HTML style information to the output. Select Add -> Header and enter the text shown in the dialog box.

To add a footer , select Add -> Footer and enter the text shown.

Save the output

Executing the pipeline results in applying the filters that we added above. To save the output, select Output -> Save As from the main menu. Select the output file by clicking browse and navigating to the output directory.

We also need to specify a suffix for the backup file. The output file, if it exists, is backed up each time the pipeline is executed using the suffix along with a numbering scheme.

To save the output in Unicode or any of the other supported encodings, select the encoding from the "Use Encoding" table. Click Finish to add the "Save As" filter.

Execute and Preview Output

Click the execute button to run the pipeline. The output is saved to the specified file. Open the file in a browser and verify that the transformations have been applied to satisfaction.









































Ferrite Platform
version 1.5
US$49
90-day money back guarantee
Free Upgrades for 1 year!

Ferrite Platform
version 1.2
Free Download
Try it risk-free for 14 days