TableExtract by Heinz Repp (c) 5/2000:
Tables from Html to Ansi

What is TableExtract?

Many websites offer information in form of tables, often created from
databases. It would be useful if one could process this information
in own spreadsheets, databases or word documents. For example one
internet provider offers me a list of connection times and charges for
each month on a html page; I would like to keep them in a spreadsheet
for later reference. Or a producer of nautical charts has several
pages that list his charts for several areas; I would like to keep
them all in one database and have them ordered or selected for various
criteria. I found that the spreadsheet program offers the possibility
to *store* its files in html format, but is unable to import a html
table into a spreadsheet. Database programs offer various import
formats, but html tables are not among them. Here comes TableExtract:
it scans one or more html files for tables (within <table> tags), rows
and cells, and writes them to files, one for each table found, each
table row to a separate line with the table cell contents separated by
tabulators, a format that is understood by virtually any database,
spreadsheet or text program.

Usage:

Save the web pages you want to extract tables from to your local disk
or get their filenames in the cache directory of your browser. Then
call TablExtr.exe with the name of the html file(s) you want to scan
on the command line. Multiple files are allowed, as well as wildcards
(* and ?) with the executables included. For each table found one
file with the same base filename and the extension t<xx>, where <xx>
is a number increasing from '00', is created in the directory where
the html file resides.

Technical:

Nested tables are quite common in html files. TableExtract stores
them to different files, replacing the inner table with its filename
in angle brackets. Tables nested deeper than a defined limit
(currently 8) are ignored, in the filename reported in the innermost
outer table the number is replaced with dashes (.t--). As space is
insignificant in html files, TableExtract ignores leading and trailing
whitespace (spaces, tabs and newlines) in each cell, compressing
successive whitespace between non whitespace characters to a single
space. HTML entities are strings starting with an ampersand and
ending with a semicolon replacing characters outside the 7-bit ASCII
character set. TableExtract replaces them with the 8-bit value of the
iso-8859-1 (Latin 1, ANSI) character set they represent; unknown,
Unicode or invalid entities are copied unchanged. An entity used very
often is the non-breaking space, '&nbsp;' or '&#160;'; as said before
this is replaced by the ANSI code 160, *NOT* - as you might think -
by a space. You might want to change this later in the files produced;
you can enter the non-breaking space symbol in any Windows program's
dialog by typing '0160' on the numeric keypad while pressing the <ALT>
key simultaneously. As Windows uses ANSI as the standard single byte
character set, the output files are appropriate for Windows programs.
DOS or OS/2 programs may need the 8-bit characters exchanged by tools
like OEM2ANSI.

License:

This copyrighted software is provided 'as is' without any warranty
whatsoever. You may use it free of charge for whatever you want
completely at your own risk. You may give it to others provided the
files are left unchanged and this file is included.

Please report bugs and suggestions to:
Heinz.Repp@online.de
