Content Extractor

1.0a

12/28/99
Initial development. It drives the sample site on Linux just fine.

1.0b

12/29/99
Added error messages and development hints. Added delimiter variable for Mac compatibility. Changed PATH_INFO delimiter to $ for Mac compatibility and directory structure consistency.

1.0

1/17/00
Discontinued use of PATH_INFO -- the template is now the first argument. Fixed file reading so it will take any kind of line break. Cleaned up code, added comments, added error messages and other misc. improvements. Driving main Type A site on Linux, and also works on WebStar (Mac) server.

1.1

1/19/00
Added a registration checking routine. You can specify a delivery date, grace period, and registration code in the script when installing it on a client's web server, and it will stop functioning after the grace period expires, unless a file called registration.ini, containing the correct registration code, is added to the cgi-bin.

2.0b

1/25/00
Added editing mode to edit content live in the browser.

2.0

2/16/00
Cleaned up glitches in live editing mode. Added auto-save when leaving pages. Renamed script "content.cgi" -- content because it now does input and output; cgi to prevent hangs in MacPerl.

2.0.1

2/18/00
Fixed a line-spacing incompatibility between Netscape and IE during live content saves.

2.0.2

2/20/00
Allowed regex special characters during live content saves.

2.1

3/20/00
Changed the way arguments are passed to the script. You should now use two name/value pairs -- one for the template file, and one for the other arguments. So, a dynamic URL might look like this:

../cgi-bin/content.cgi?template=main.html&args=company.html,about

Also, the .html extension in your dynamic URL is now optional. If you don't specify an extension, Content Extractor will default to .html.

2.1.1

5/1/00
Oops, bug in the script which prevented the .html optional feature from working on some web servers. Fixed now.

2.1.2

8/17/00
Changed parsing routine so that if an ar

2.1.2

8/17/00
Changed parsing routine so that if an arg is missing, that embed simply won't get processed; so you can parse an indeterminate number of embeds by including or leaving off your trailing args.

Added support for nested content files on multiple platforms (e.g. template=faculty&args=faculty/smith,bio,quotes).

Set the HTML header so that when running from a Mac server, the page is never cached. This will aid development within our office network.

2.2

10/26/00
Fixed bug that was interfering with live editing routines, and updated various pages in the edit directory.

Added content.php, a PHP wrapper that lets you embed PHP code in your templates or content files, or include PHP files as an argument to an <!--#embed file --> tag.

2.2.1

11/16/00
Returned backwards compatibility with old-style query strings, which was inadvertently lost with the bug fix in version 2.2.

2.5b

12/8/00
The extractor no longer chops the last character out of the files it processes, in case the last character is not a line break.

Added new embed type, random, which takes one argument, contentFile. The fields in contentFile must be named with consecutive numerals, but it can be a regular HTML pseudo-database file in every other respect.

Added new embed type, passthru, which takes one argument, value. Passthru simply writes the value of the argument into the HTML code. It can be used to display text, or to set the value of an HTML attribute, such as an image source.

Embed tags can now include hard-coded values for any or all of their arguments. For example, the tag <!--#embed field "my_content", arg1 --> will extract the field specified in the first URL argument, from the content file my_content.html.

Templates and content files can now be located in non-standard locations. For example, template=my_template will open the file ../templates/my_template.html, but template=../other/my_special_template will open the file ../other/my_special_template.

An alias file can now be used to shorten awkward URLs and help manage duplicate links throughout a site. If you create a pseudo-database file at ../content/alias.html, you can store a dynamic URL definition (template=xxx&args=y,z) in each field, with the field name serving as the alias to that URL. Then, you can open dynamic URLs by referencing just the alias. For example: http://www.mydomain.com/scripts/content.cgi?alias=mycomplicatedURL.

2.5b2

1/8/01
You can now enclose hard-coded embed tag values in either single or double quotes.

Also, the extractor recognizes other name/value pairs besides template= and args= in alias definitions.

2.5b3

1/28/01
You can now use a combination of hard-coded and variable values in embed tags, using the "+" symbol for concatenations. For example, with the URL ../scripts/content.cgi?template=default&args=main,joe, the tag <!--#embed field arg1,arg2+"_picture" --> will extract the field called "joe_picture" from the file main.html.

2.5b4

2/18/01
Embed tags that refer to blank arguments won't be parsed. This allows you to leave a field blank by leaving out an argument it uses, without affecting the positions of the other arguments. For example, in the URL ../scripts/content.cgi?template=default&args=content,field1,,field3, the first and third embed tags will be parsed, but the second one will simply be left blank.

Also, content.php will now work from any directory level.

2.5.1

3/5/01
Content.cgi now detects a "flash=1" flag in the URL, and if present, redirects to a standard Flash template movie in ../templates/templates.html, passing the template and args variables to the Flash templates.

Also, content.cgi gives a correct error message if it can't find the specified content file. Previously, the script failed with an Internal Server Error; now it prints "Error: Can't open the template file ../templates/nonexistent.html. No such file or directory."

2.5.2

3/29/01
The passthru embed type, used in conjunction with combination arguments (hard-coded plus variable), now works.

2.5.3

4/4/01
Content.cgi now looks for a replace property in embed tags, which triggers some HTML cleanup required for display within a Flash template. For example, <!--#embed field arg1,arg2 replace spaces,paragraphs -->.

The allowed replace values are:

spaces - removes all line breaks and double spaces
paragraphs - adds a <br> tag after every <p> tag
ampersands - replaces all ampersands with the ` character, which can be restored within Flash
links - changes the color of all links to the link color specified in the template's body tag,
and sets the target of all links to "_new"; link colors can also be overridden by
in each embed tag: <!--#embed field arg1,arg2 replace link-000099 -->

Also, you can now include more than one embed tag on a single line of your template, and they will be correctly parsed.

2.6

4/17/01
Content.cgi now supports several new embed types, which can be used to quickly create linear navigation elements for presentations or e-learning products. See the extractor demo for examples. The new types, with the arguments they require, are:

nav_current - returns the position of the current field in the file (contentFile, contentField)
nav_count - returns the number of fields in the file (contentFile)
nav_next - returns the name of the next field in the file (contentFile, contentField)
nav_previous - returns the name of the previous field in the file (contentFile, contentField)
nav_first - returns the name of the first field in the file (contentFile)
nav_last - returns the name of the last field in the file (contentFile)

The next and previous tags automatically loop, in other words, next returns the first field when going beyond the last field, and previous returns the last field when going before the first field.

You can also skip fields with the next and previous tags, which is handy if your content file is organized like page1, page1_title, page2, page2_title. In this case, you can use <!--#embed nav_next2 arg1,arg2 --> on page1 to get to page2, because "nav_next2" tells the extractor to move ahead two fields.

2.6.1

4/18/01
Ampersand replacement (for use with Flash templates) now uses the string "*amp*" rather than the character "`" as a delimiter.

3.0

5/14/01
The content extractor toolset now includes a major new feature: built-in logging of site statistics. Third-party stats packages don't work well with the content extractor, because they list most hits as content.cgi, giving no useful information about pages visited. But the new stats functionality logs the argument strings, along with the user's IP address, host, referrer, and user agent. To install the stats functionality, copy the stats folder to your site -- it contains stats.cgi, an empty data folder, plus template.html, which you can edit to match the rest of your site. You can also place an index.html page in stats, to redirect to stats.cgi, so you can view the stats at http://www.yourhost.com/stats. Better yet, use .htaccess in the stats folder to set stats.cgi as the index page, and to password protect the directory.

Content.cgi can now read templates over an http connection, rather than directly through the filesystem. This may slow down site performance a bit, but allows you to use CGI scripts or other dynamic pages as templates, because any server-side scripting in those files will execute before the files are parsed by content.cgi. If you've ever wished you could use PHP logic to construct an embed tag in a template, you'll appreciate this feature. Note: reading over http requires that LWP/Simple be installed on the web server; if not present, content.cgi will read files through the filesystem as in previous versions, and display a warning message.

Both the stats and the http reading can be turned on or off sitewide with another new feature, the content.ini configuration file. Content.ini is optional, but also provides settings for the default file extension (which still defaults to .html if not specified in this file) and the registration code (which makes the registration.ini file obsolete).

3.0.1

5/18/01
Content.cgi can now read any file, whether a template or a con

As before, references to files in nonstandard locations must begin with ../, otherwise content.cgi will look for them in ../content/ or ../templates/.

Another security enhancement is that content.cgi will not read any file whose name begins with a ".". This makes it impossible to display the contents of .htaccess or .htpasswd files, for example.

And developers can now choose to hide content extractor error messages, by adding the following line to content.ini:

show errors: 0

In that case, extractor errors will be sent to the browser as commented HTML, rather than visible HTML, but will be logged in the stats file as before.

A final minor change is that a leading "." is now optional in the default file extension setting of content.ini.

3.2.1

12/3/01
Fixed a bug in 3.2's nonstandard paths functionality which prevented path settings in content.ini from taking hold if ../templates/ and ../content/ weren't specified. Those two paths are now assumed and don't need to be specified in the ini file.

3.2.2

1/20/02
The stats routine of content.cgi now writes extractor errors and HTTP errors to the errors file, rather than including HTTP errors in the primary stats file. This change accomodates new functionality in stats.cgi version 1.2.2, and requires that version of the script or later for best results.

3.2.3

2/4/02
Previously, a request for a field called ##home would return a field called ##home_placeholder if ##home_placeholder appeared in the content file before ##home. The matching behavior now avoids this situation and only returns an exact match.

4.0

3/5/02
The most substantial update to the content extractor so far, version 4.0 uses a different format for dynamic URLs and embed tags and is not backwards-compatible with any previous version. It provides usability enhancements for content editors, a clearer architecture for developers, and an integrated site flattener for delivering web sites on local drives or servers without Perl.

Fundamental changes

Instead of two key-value pairs, template and args, dynamic URLs now consist of a template variable plus a variable for each embed tag in the template. Embed tag names correspond to variable names in the URL. For example, a template containing three tags...

<!--#embed env_http_user_agent -->
<!--#embed env_remote_addr -->
<!--#embed env_request_uri -->

In addition to the CGI-standard variables, one additional variable is available. This tag displays the modification date of the newest content file or template used to build the dynamic page:

<!--#embed env_last_modified -->

Environment variables, by the way, are also available in conditional tags using Perl syntax:

<!--#if1 ($ENV{'HTTP_USER_AGENT'} =~ "Mac") -->Hello, Macintosh user!<!--#end if1 -->

A few new options have been added to content.ini. "prevent page caching" modifies to header of each dynamic page to prevent the browser from caching it, which may be helpful during troubleshooting. "show expanded URLs" will instantly redirect any short-form dynamic URL to its full format, another troubleshooting aid. And "external link targets" allows you to set the target attributes of anchor tags for absolute links sitewide. If an anchor tag already contains a target attribute, this setting won't override it.

In addition to these new options, you can now set options temporarily on a page-by-page basis by including the setting in the dynamic URL. For example, this URL...

../scripts/content.cgi?page=home&show_expanded_URLs=1

...will redirect to a page that displays the full version of the home URL, even if "show expanded URLs" is turned off in content.ini. The only option that can't be overridden in this way is "read files from," which would have undermined the server security provided by that option.

The PHP wrapper

Content.php has also been updated to better accomodate a variety of hosting situations. To use content.php with the shortened URLs described above, simply change...

../scripts/content.cgi?page=home to ../scripts/content.php?page=home

To use it in conjunction with .htaccess files, change...

../dynamic/home.html to ../dynamic/home.php

The site flattener will not render PHP that the PHP wrapper would normally render. If it encounters a URL that uses the PHP wrapper, it will save the file to filename.php so that the PHP will still be active in the flattened site.

Miscellaneous fixes

Content.cgi will now filter attempts to read files outside of the standard directories when the filenames begin with ./ as well as ../. This is an enhancement to the security functionality introduced in version 3.2.

If a variable referred to in a conditional tag expression is not available in the current URL, you can still achieve a true condition by testing for (!var[0]) or (var[0] != 1). Previously, the expression evaluated false no matter what you tested for.

4.0.1

3/18/02
Fixed a bug where content.cgi?page=test&test=foo would generate an invalid page error even if a page named test was defined in pages.txt. This was occurring because content.cgi was treating the second argument as an override for the value of test in pages.txt, when really, URL overrides shouldn't affect parsing of the pages.txt file. Now, the overrides will only affect parsing of the content.ini file.

4.0.2

5/16/02
Fixed a small bug that prevented the external link target setting from working when the external link value was quoted.

4.1

5/26/02
When creating templates, you no longer need to number conditional tags to show the content extractor which tags go together; you can just include if/end if tags, nested in any combination, and the extractor will parse them correctly. The extractor also supports the use of else tags now. For example:

<!--#if (main[1]=="home") -->
This is the home page
<!--#else -->
This is another page
<!--#end if -->

These changes should result in the conditional functionality of the content extractor functioning just like it does in any C++-style programming language.

You can now strip all markup and trim surrounding line breaks from the content that you embed into your templates by adding the word "strip" to the end of your embed tags. For example, any content embedded with the tag <!--#embed main strip --> will have markup removed on the fly by the extractor. This is useful for embedding data into JavaScript or other code, where formatting that content editors add in the HTML content files might disrupt the operation of the code.

The content extractor no longer puts double-quotes around external link targets when writing them into your content files. It does, however, copy any double- or single-quotes that you include in the content.ini setting. For example, setting external link target: '_blank' will cause the extractor to write target='_blank' into your <a> tags. By the way, you can disable this feature without removing the setting from your content.ini file by setting external link target: 0.

Perhaps more importantly, the extractor no longer adds a target attribute to links if a target has already been set. This should improve compatibility with different web browsers by ensuring that only one target setting appears.

Previously, the field marker used to separate fields in HTML content files was always ##. You can now use any field marker by specifying it with the "field marker" setting in content.ini. If you don't set this, the extractor will use ## as the default.

4.1.1

5/28/02
You can now load page definitions from a file other than ../content/pages.txt, if you specify an alternative file using the "read page definitions from" setting in content.ini. Besides offering the ability to customize the name and location of this file, this feature also allows multiple site structures (such as you might create when building a multilingual site) to share the same page definitions. For example, sites driven from /en/scripts/ and /es/scripts could share the same page definitions at ../../content/pages.txt. If you don't include this setting, the extractor will load ../content/pages.txt by default.

Another change intended to support multilingual site development regards the way that the extractor finds and loads external content and template files. The extractor now reads files relative to the directory location that appears in the browser window, rather than the physical location of the content.cgi script. This would allow you, for example, to create English content at /en/content and Spanish content at /es/content, and create symbolic links at /en/scripts and /es/scripts that both point to /scripts. Even though the English and Spanish sites both use the same content.cgi script, content.cgi will load the English or the Spanish content depending on the URL from which it was accessed.

Finally, this version fixes a bug that hung content.cgi in some cases if a conditional tag didn't contain a space between the text and the closing comment tag. For example, <!--#end if--> would hang the extractor, where <!--#end if --> would not. The latter style is preferred, but the content extractor was designed to allow some flexibility in the writing of embed tags, so the former should now also work.

4.1.2

6/4/02
Updated an internal path reference to ensure that the content extractor could read files on IIS servers.

5.0

8/13/02
The most substantial update since version 4.0, this version adds significant new features, new names and locations for many of the files, and a new name for this development tool: Contemplate.

Fundamental changes

All files associated with Contemplate are now located in a single directory called "contemplate" at the root level of your website. This organization should help distinguish the Contemplate files from your own site files. Here's a summary of the renamed files:

contemplate/assembler.cgi was: scripts/content.cgi contemplate/assembler.ini was: scripts/content.ini contemplate/assembler_wrapper.php was: scripts/content.php contemplate/pages.txt was: content/pages.txt contemplate/reporter/reporter.cgi was: stats/stats.cgi contemplate/reporter/inspect.cgi was: stats/focus.cgi

Unfortunately, due to this broad reorganization, sites built with Contemplate 4.1.2 or earlier are not directly compatible with Contemplate 5.0 or later. Fortunately, updating the site to work with Contemplate 5.0 will be easier than updating a 3.x site to 4.0. In many cases, you need only search your site files and replace "scripts/content" with "contemplate/assembler."

One larger conversion task, though, is due to the fact that Contemplate no longer adds a default file extension to the file names you provide it. If your dynamic URLs, or the entries in your page definitions file, rely on a default file extension setting in assembler.ini, you'll need to edit these locations to specify the file extension. Contemplate 5.0 requires file extensions because they provide more reliable execution of the builder component and search routines.

Finally, Contemplate is now available in other languages for the first time. To address the performance concerns of Unix webmasters, a PHP version is available, and to address the compatibility needs of Windows webmasters, an ASP version is available. These new ports will behave identically to the original Perl version, except where stated in this documentation.

New functionality

The field and random embed types were enhanced to provide support for XML-based content files. Previously, content files had to be organized using HTML tables and the field marker. Now, you can organize your content using XML tags. Contemplate currently recognizes two HTML tags: content and group.

For example, if the following content were saved into a file called sample.xml...

<content name="myname">John Doe</content>

<group name="measurements">
<content name="height">67</content>
<content name="weight">115</content>
</group>

...and you had a template called default.html that contained embed tags called name, height, and weight, you could access the relevant content with the following URL:

../contemplate/assembler.cgi?template=default.html&main=field,sample.xml,myname& height=field,sample.xml,measurements/height&weight=field,sample.xml,measurements/weight

Notice that in the case of standalone content tags, you can access the content using a field embed tag in the same way that you would access content in an HTML content file. In the case of grouped content tags, you can access the content by specifying a "path" to the desired content tag, separating all enclosing group names with slashes. With this technique, you can organize content by nesting it to as many levels as you wish.

The new XML parsing routines take effect when your content file has a .xml extension. You may mix HTML and XML content files freely throughout a project.

A completely new embed type provides another new way to organize content. The form embed type allows you to access default values of an HTML form in a content field. For example, if a field called "125" in the file "employees.html" contained the following content...

<form name=foo>

Name
<input type=text name=name value="George Jones">

Role
<select name="role">
<option value="Developer" selected>Developer</option>
<option value="Project Manager">Project Manager</option>
<option value="Instructional Designer">Instructional Designer</option>
</select>

Skills
<input type=checkbox name=skills value="HTML" checked>HTML 
<input type=checkbox name=skills value="JavaScript">JavaScript 
<input type=checkbox name=skills value="PHP" checked>PHP

</form>

...and you had a template called default.html that contained embed tags called name, role, and skills, you could access the relevant content with the following URL:

../contemplate/assembler.cgi?template=default.html&main=field,employees.html,myname& name=form,employees.html,125,name&role= form,employees.html,role& skills= form,employees.html,125,skills

Contemplate can automatically access values from text, radio, checkbox, select, and textarea form elements. When checkbox or select elements contain multiple values, Contemplate returns a comma-delimited list of values.

The form embed type can be useful either for setting up database-like data structures, or for enforcing standard types and formats for your content. Developers can create a "shell" HTML form in one content field, and then content editors can duplicate that form and change the default values in each new instance.

And the search embed type now has a completely different meaning than before. Previously, you could specify two strings, and Contemplate would return any text occurring between those two strings. Now, you can use the search type to create standard site search mechanisms. Simply add a search form like this to any page...

<form name=search method=get action=../assembled/search_results.html>
<input type=text name=search_string> 
<input type=submit value=Submit>
</form>

...then, in the URL for the search_results page, set an embed tag to the type "search" with no other arguments. Contemplate will search all your content files for instances of the search string, rank the pages by frequency, and embed the results in the location of the search tag.

Picking up where the strip attribute left off, you can now perform multiple search and replace operations on your content as you embed it into your templates. For example, if you write an embed tag like this...

<!--#embed main replace /"/&quot;/ -->

...assembler.cgi will replace all double-quotes with the &quot; entity. You can combine the strip and replace attributes, you can include multiple replace attributes in your embed tag, and you can use regular expressions in your search values. For example, this tag...

<!--#embed main strip replace /"/&quot;/ replace /\s+/ / -->

...will tell assembler.cgi to first strip all markup tags from the content, then replace all double-quotes with HTML entities, then replace all multiple spaces with a single space. The strip operation will always be performed before the replace operations, regardless of its position in the embed tag; but replace operations will be performed in the order in which they appear in the embed tag. If you need to do some replacements before stripping tags, you can use a regular expression to strip tags rather than the strip attribute. For example, the three replace attributes in this tag...

<!--#embed main strip replace /<p>/\*/ replace /<[^>]*>// replace /"/&quot;/ -->

...will replace all paragraph tags with asterisks, then remove all markup tags, then replace double-quote characters with their HTML-entity equivalent.

The search option does have a couple known limitations. Currently, you can't use a forward slash symbol (/) or the string "-->" in search and replace values.

Because of limited usage, the body tag type was removed. If you wish, you can achieve the same results using the field type and the search option:

<!--#embed field replace /<body[^>]*>(.*)<.body>/$1/ -->

Because of limited usage and overlap with the new XML functionality, the tag embed type was also removed.

Miscellaneous fixes

Some navigation elements, titles, and the overall appearance of the Flattener were adjusted to conform to the other Contemplate utilities.

When you use the show_expanded_URLs option, the expanded URL won't include the show_expanded_URLs argument, which was redundant.

The PHP wrapper for assembler.cgi can't pass the show_expanded_URLs flag through to the assembler, and was displaying an error message when it encountered the flag. Now, it will reload the page without the flag, which allows the page to display without errors. Note that the PHP wrapper should no longer be needed for sites on PHP-capable servers, which can use the PHP port of Contemplate. However, we'll leave the wrapper file available in the Perl version to ease migration of older sites.

The PHP wrapper now passes environment variables on to assembler.cgi, so traffic reports will provide more detail. Previously, all requests to "wrapped" pages adopted the IP address, host, and user agent of the PHP server, rather than of the site visitor.

Previously, if your page definitions file contained a page called "abcde" followed by a page called "cde," a request for page cde would bring up page abcde. This has been fixed.

Previously, some of the navigation functions didn't work properly when files were saved with DOS line endings. Assembler.cgi now has a better routine for conforming line endings to ensure cross-platform compatibility.

In some cases, the navigation functions didn't work properly when the page definition file contained extra line breaks between sections. Assembler.cgi now has a better routine for splitting up sections and providing this functionality.