Chapter 5. Conversion specifications

This section describes the conversion specifications when converting from Word to HTML on the command-line version.

5.1 Original documents

The original document file format of the conversion source is docx file only. doc format files saved in old Microsoft Word are not subject to conversion processing.

5.2 Version of destination HTML

By default, tags that conform to the HTML specifications are output.

HTML specification reference

If you specify “-xhtml” parameter as a conversion option, XHTML 1.0 compliant tags will be output.

In addition, the tag samples of the following conversion specifications explain the state of conformance to the HTML specifications.

5.3 Root, head and meta-information

Conversion source

Conversion destination (HTML tag)

Remarks

Root

<!DOCTYPE html>

<html lang="">

Japanese ver.: lang=”ja”

English ver.: lang=”en”

See Note 1 for language judgment.

The language can also be specified by a parameter in the conversion options.

Character encoding

<meta http-equiv="Content-Type" content="text/html;

charset=UTF-8">

UTF-8 is the basic format. In addition, Shift_JIS and UTF-16 can be specified as conversion option parameters.

Info: Title

<head>

<title>~</title>

</head>

Get the title information from the contents of the property "Title" on the Word "Info" tab.

Meta-information

<head>

<meta name="author" content="">

<meta name="description" content="">

<meta name="keywords" content="">

Converts the property items in the Word "Info" tab to name attribute values and the settings to content attribute values. the correspondence between the name attribute values and content attribute values is as follows:

author: Author
description: Comment
keywords: Tag

CSS link

<link href="xxx.css" rel="stylesheet" type="text/css" media="print">

xxx.css is the specified CSS file name. The media attribute is optional.

Default style

<head>

<style>CSS style</style>

</head>

Sets the default CSS to be applied to the entire HTML. The two settings are as follows:

(1) Paragraph text alignment (see 5.9)

(2) border attribute of table, vertical position of td/th (vertical-align).

However, it is not output when linking external CSS.

JavaScript specification

<head>

<script src=”xx/yy.js”></script>

</head>

xx/yy.js is the JavaScript path

Note 1 Language judgement

Estimated from the percentage of full-width characters in a Word document and the default style language setting. Note that estimates may not be correct.

In such cases, the language can be specified in the conversion options at command line execution. See the "-lang" parameter in the table of conversion options for details.

5.4 Block elements

Conversion source

Conversion destination (HTML tag)

Remarks

Body text

<body>-</body>

Title style

When outline level 1 is set for the title style.

<h1>-</h1>

Some of the title styles registered in Word's Style Gallery have outline level 1 set, while others do not.

When the title style does not have an outline level set.

<p>-</p>

Paragraph

<p>content</p>

By default, lines with only line breaks are ignored.

If the "-emptyP" parameter is specified in the conversion options, lines with only line breaks are output as empty <p></p>.

Forced line break

<br >

Forced page break and column break

Ignored.

Section

When the <h> start tag is at the beginning or only the <h> tag with a lower rank before it, the <section> start tag is output before the <h> start tag. When there is a <h> with a higher rank before it, output the </section>.

Create a tree structure with the <section> tag before <h>.

If the "-xhtml" parameter is specified in the conversion options, <section> tags are output as <div class="section-area"> tags.

5.4.1 Heading styles and outline levels

Conversion source

Conversion destination (HTML tag)

Remarks

Heading 1 to Heading 6 (Heading style)

<h1>-<h6>

Set the heading style outline level to the heading rank tag.

Heading 7 to Heading 9 (Heading style)

<p class=”l7”>~

<p class=”l9”>

Heading style outline levels 7 and 8 are set as class attributes of a paragraph.

Paragraph outline levels 1 to 6

<h1>-<h6>

Set the paragraph outline level to the heading rank tag.

Paragraph outline levels 7 to 9

<p class=”l7”>~

<class=”l9”>

Paragraph outline levels 7 and 8 are set as class attributes of a paragraph.

5.4.2 Heading outline numbers

When an outline number is added to a paragraph for which a heading style is specified, the outline number is enclosed in a <span> tag with the class attribute value “number”, and converted to the content string of the <h> tag after specifying the class attribute value number for the outline number. If there is a space between the outline number and the heading text, the space is output as a single-byte space, or if there is a tab, the tab is deleted and a single-byte space is inserted instead.

5.4.3 Lists

Paragraphs with Word lists are converted to HTML lists (unordered lists) (<ul>/<li>). At this time, the bullet symbols in Word paragraphs are removed.

5.4.4 Paragraph numbering and ordered lists

Paragraphs that have been numbered at the beginning of a paragraph using Word's paragraph numbering feature (numbered paragraphs) are converted as follows:

  1. When a numbered paragraph is preceded or followed by an unnumbered paragraph or line break, the numbered paragraph is output as an HTML paragraph (<p> tag). In this case, the paragraph number is enclosed in a <span> tag with the class attribute value specified as number, and then output as normal text.
  2. When two or more numbered paragraphs are consecutive, they are output to an HTML ordered list (<ol>/<li> tags):
  1. If numbered paragraphs are arranged in a hierarchy and the first and next paragraphs are adjacent to each other, even if they are at different levels, they are considered to be consecutive.
  2. Sets the type of numbering specified in the Word document as the value of the class attribute of the <ol> tag.

The start number is specified in the start attribute when the start number is 2 or more.

5.4.5 Paragraph style name (optional)

By default, paragraph style names are not output.

If you specify “-pstyle” parameter as a conversion option, the name of the paragraph style is output as the value of the class attribute of the <p> or <h> tag when a paragraph style is specified in a Word paragraph. When paragraph formatting is specified without using the paragraph style feature, the value of the class attribute is not set.

5.5 Figure and figure arrangements

5.5.1 Output folder and file name for illustrations

  1. The illustrations inserted into the docx document are extracted from the docx document and the that paths are set to the value of the src attribute of the <img> tag in HTML. The default folder for extracted illustration files is "image". If the "-fileimages" parameter of the conversion option is specified, a folder named "destination_file_name.images" is created for each output HTML file. The file names are automatically generated with sequential numbers.
  2. Illustrations linked to a docx document will have the path of the linked file set to the value of the src attribute of the <img> tag in HTML. Linked illustraion files will not be copied or moved. The illustration paths are converted to relative paths from the output HTML file. If the original docx document and the folder of the linked illustraions have been moved, the path may not be set to a proper relative path.
    Note that if the "-embedimg" parameter is specified, the images will be embedded in the HTML file.

Note that if the "-embedimg" parameter is specified, the image will be embedded in the HTML file.

5.5.2 Image and shape formats

By default, images are converted to PNG or JPEG format, and AutoShape, line shapes inserted in Word, and shape files in EMF and WMF formats are converted to SVG format for output.

If you specify the “-throughimg” parameter in the conversion option, images and shapes inserted into Word in GIF, EMF or WMF formats are saved to the illustration output folder in their original formats without file format conversion.

5.5.3 Layout Options

Saves the layout option type specified in Word format as the <img> class attribute.

Conversion source

Options

class attribute

In Line with Text

Please enter alt text.

class="inline"

With Text Wrapping

Please enter alt text.

Common for “With Text Wrapping”

class="block"

Square

Please enter alt text.

class="block square"

Tight

Please enter alt text.

class="block tight"

Through

Please enter alt text.

class="block through"

Top and Bottom

Please enter alt text.

class="block top-bottom"

Behind Text

Please enter alt text.

class="block behind"

In Front of Text

Please enter alt text.

class="block front"

Notice

In CSS, the display property specifies whether the figure layout is inline or block. Since the default value of the display property is inline, even if you set “With Text Wrapping” in the Layout Options in Word, it may be displayed as “In Line with Text” in the browser. In such a case, specify as follows in CSS:

img.block {
display: block
}

5.5.4 Position to output the figure with “With Text Wrapping” specified

The output position of the <img> tag for an illustration that specifies string wrapping is after the end tag of the block that sets the anchor in headings and paragraphs. However, in bulleted items, it is just before the end tag. For details, refer to “6.4 Layout of shapes”.

5.5.5 Alternative text for figures

Outputs the alt attribute to the <img> tag in HTML, where the value of the alt attribute is the string entered to the alternate text for the figure inserted in the Word document. If no string is set, "Please enter alt text." is output.

5.6 Formula

Formulas edited in Word's formula editor are output as SVG format files using <img> tags by default.

Depending on the conversion option parameters, you can convert to an external file in MathML format, convert to MathML format markup, or output as Office Math markup which is the Word's unique representation of Office Open XML formulas.

Parameter

Output format

Unspecified

Output formulas to <img> tags as svg format files.

-math

Output formulas to <img> tags as MathML format files.

-xmath

Output formulas as mathML format markups.

-omath

Output formulas in Word's own Office Math format.

5.7 Tables

Conversion source

HTML element

Example

Table

<table>

<tbody>

<tr>

<td>

The value set in the "Table Styles" property: Name in the Word ribbon "Table Design" will be output as the class attribute of the <table> tag.

Style names other than single-byte alphanumeric characters and some single-byte symbols are not output as the value of the class attribute.

Merge

Cell merge

<td colspan="n">

“n” is the number of horizontally merged cells.

Row merge

<td rowspan="n">

“n” is the number of vertically merged cells.

5.7.1 Table header row

To output the table header tag (table header: thead), set either of the following in the first row of the table.

  1. Select the first row of the table and turn on "Repeat Header Rows" in "Table Tools: Layout" on the Word ribbon.
  2. Check only "Header Row" in "Table Style Options" in "Table Tools: Table Design" on the Word ribbon.

Conversion source

HTML element

Description

Please enter alt text.

“Table Tools: Layout”

Please enter alt text.

“Table Tools: Table Design”: “Table Style Options”

<thead><tr><td>…</td></tr></thead>

The first row of the table is enclosed with <thead>.

If you turn on “Repeat Header Rows”, the header rows will be repeated on each page whren the table spans pages. If you want to avoid this, turn off "Repeat Header Rows" and check "Header Row" in “Table Style Options” in “Table Design”.

5.7.2 Table header column

Select the first column of the table and check only "First Column" in "Table Style Options" in "Table Tools: Table Design" on the Word ribbon to set the cell of the first column as the header cell.

Conversion source

HTML element

Description

Please enter alt text.

“Table Tools: Table Design”: “Table Style Options”

<tr><th>…</th></tr>

The cells in the first column of the table are marked up with the header cell tags.

5.7.3 Cell alignment

When the alignment in a cell is specified in "Alignment" of the Word ribbon "Table Tools: Layout" or in the table style property cell, the class attribute is output to the <td>/<th> tag for the vertical alignment, and the style is defined in the <head> of the HTML with the <style> tag. However, if external CSS is linked or "-defstyle" is specified in the conversion option, the style definition is not output.

Conversion source

HTML element

Description

Please enter alt text.Table Tools: Alignment Options in Layout

Output in <head>

<style>html{text-align:justify;}table,td,th{border:solid 1px;}td,th{vertical-align:top;}td.center,th.center{vertical-align:middle;}td.bottom,th.bottom{vertical-align:bottom;}</style>

The relevant styles are in bold in the source code on the left.

Align Top

No output due to default value.

Align Center (vertical)

<td class=”center”>/<th class=”center”>

Align Bottom

<td class=”bottom”>/<th class=”bottom”>

Tip

The horizontal alignment is output as a class attribute in the paragraph <p> tag within the <td>/<th> tag.

Justified:class=”start”
Center:class=”center”
Right:class=”end”

5.8 Inline elements

5.8.1 Font group

Font group

HTML element

Example

Bold

strong

If the "-hstrong" parameter is specified in the conversion options, the bold set in the heading style is ignored.

Italic

Ignored by default. Output with <i> tag or the following CSS style specification in the conversion options:

<span style="font-style:italic;>

Underline

Ignored by default. Optionally set the <u> tag or the following CSS style specification for output:

<span style="text-decoration-line:underline>

Note that the anchor text of the link is not underlined.

Strikethrough

Ignored by default. Output with <del> tag or the following CSS style specification in the conversion options:

<span style="text-decoration-line:line-through;">

Subscript

sub

Superscript

sup

Text Effects and Typography

Ignored.

Text Highlight Color

Ignored.

Font Color

Ignored by default. Output with the following CSS style specification in the conversion options:

<span style="color;color value">

<span style="color:red;">text color red</span>, <span style="color:#00B050;">text color green</span>

Character Shading

Ignored.

Enclose Characters

Ignored.

Font

Ignored.

Font Size

Ignored.

Case

Ignored.

Phonetic Guide

ruby rp rt

<ruby>紫陽花<rt>あじさい</ruby>

<ruby>漢<rp>(</rp><rt>かん</rt><rp>)</rp>字<rp>(</rp><rt>じ</rt><rp>)</rp></ruby>

Character Border

Ignored.

5.8.2 Links and cross-references

References

HTML element

Example

Link (external URL)

<a href=”Link URL”>label</a>

“Link” on the “Insert” tab on the ribbon.

Link (id)

<a href=”#id value”>label</a>

Cross-reference

<a href=”#id value”> label</a>

References in Word documents by "Cross-references" in the "References" tab on the ribbon.

<span id="">

id value

<span id=”id value”></span>

Link to bookmark "here"

5.9 Paragraph text alignment

Set the paragraph alignment set to the “Normal” style in the style gallery on the “Home” tab of the Microsoft Word ribbon to the <style> element of the <head>. However, when left alignment is set in the "Normal" style, text-align:start is the default value in CSS, and it is not necessary to specify the alignment, so it is not set.

Note that <style> in <head> is not output if “-defstyle” parameter is specified in the conversion option (see "3.2 Conversion options").

Paragraph alignment

Elements and class attributes

Example

Alignment of "Normal" style

Align Left

No settings.

<style><style>

Center

text-align:center

<style>html{text-align:center;}

Align Right

text-align:end

<style>html{text-align:end;}</style>

Justify

text-align:justify

<style>html{text-align:justify;}</style>

Distributed

text-align:justify;

text-justify:auto;

<style>html{text-align:justify;text-justify:auto;}</style>

If you specify the paragraph alignment other than “Normal” in the "Paragraph” group on the "Home" tab of the ribbon, the following class attributes will be set in the heading rank tag (h1 to h6) or p tag.

Paragraph alignment

Elements and class attributes

Example

Align Left

class="start"

<p class=”start”>…</p>

Center

class="center"

<p class=”center”>…</p>

Align Right

class="end"

<p class=”end”>…</p>

Justify

class=”justify”

<p>…</p>

Distributed

class="distribute"

<p class=”distribute”>…</p>

5.10 Text Box

- The contents of a text box without a border are converted as the text box did not exist.

- Text boxes with borders are converted to line art (SVG image) and the file name is output in the src attribute of img.

5.11 Endnote

An anchor tag is set to an endnote symbol indicating the location of the endnote in the body text, and the id of the endnote is set to the value of the href attribute of the anchor tag.

The text of the endnote is output at the end of the document, at the same level as the paragraphs at the end of the document except for the endnote. The number of the endnote is set to id="endnote-n" (n is a number).

5.12 Table of contents output

The table of contents section created using Word's table of contents function is output to an HTML file with a link to the heading section in the table of contents item. The table of contents is output as follows:

Notice

  • Only tables of contents created from paragraphs with outline levels set are supported.
  • If there are multiple tables of contents, only one will be treated as a table of contents.

In this case, the table of contents created with the "Built-In" feature of Word's table of contents function will be given priority.

Otherwise, the first table of contents that appears in the document is treated as the table of contents.

  • Table of contents for charts and tables is excluded.

HTML element

Description

<a id="mobile-side-btn" href="javascript:;"><span class="mobile-side-btn-icon" id="mobile-side-btn-icon"></span></a>

<a></a> immediately before the <nav> tag can be used as buttons to control the display of the table of contents when displayed on mobile devices.

Please refer to the following web page for a sample of the buttons for mobile devices.

https://www.antennahouse.com/html-on-word-samples

<nav class="toc-wrap">

The table of contents sections ④ and ⑤ are enclosed in ② <nav> and ③ <div> tags and output.

If the "-split" + "-tocout" parameters are specified in the conversion options, ③ to ⑤ are output as separate HTML file "inc-toc.html".

<div id="toc">

<p class="toc-heading">[Table of contents heading]</p>

The paragraph style name (blanks are converted to "-") set for the paragraph of the table of contents heading will be output.

For a table of contents inserted using Word's "Built-In" table of contents function, <p class=”toc-heading”> will be output by default.

<p class=”toc-[n]”><a href=”[ Link to the corresponding heading id]”>[Heading name]</p>

The paragraph style name (blanks are replaced with “-“) set for the paragraph of each item in the table of contents will be output.

For a table of contents inserted using Word’s “Built-In” table of contents function, <p class=”toc-[n]”> will be output by default. ([n] is a number from 1 to 6.)

The link to the corresponding heading id will output a URL starting with "#_Toc".

If the HTML file is split and output by specifying the "-split 1|2|3" parameter in the conversion options, the output will be the file name and id of the HTML file to be split. (e.g. index-1.html#_TocXXX)

5.12.1 Table of contents for split output

If the "-split 1|2|3" parameter is specified in the conversion options and the output HTML file is split according to the Word outline level, the table of contents section will be output as follows:

Specified parameter

Output

Note

Only -split 1|2|3

The table ① to ⑤ in Section 5.12 "Table of contents output" is output immediately after the <body> tag in all HTML files to be split into separate output files.

At this time, "active" is output as the class attribute of the paragraph <p> tag of the table of contents item (the highest hierarchical level in the page) that indicates the own HTML file.

-split 1|2|3 -tocout

Output table ③ to ⑤ from the table in "5.12 Table of contents output" as separate HTML files (inc-toc.html).

In addition, ① and ② are output immediately after the <body> tag in all HTML files to be output separately.

inc-toc.html can be used to load into a split-output HTML file using JavaScript or to load into other HTML files.

For this reason, inc-toc.html does not output tags other than ③ to ⑤ such as <html><head><body>, etc..

Please refer to the following web page for an example of loading a table of contents section using JavaScript.

https://www.antennahouse.com/html-on-word-samples

5.13 Split output

When the "-split 1|2|3" parameter is specified in the conversion options, the HTML file will be split and output according to the outline level of the paragraphs specified in the Word document. The outline levels that can be specified are 1 to 3.

The contents of the splitting are as follows:

Item

Content

Note

Splitting point

Within the outline level of a paragraph in Word (the value following the specified -split), split just before the next paragraph of the same level.

If the value is specified as 2 or 3, they are also divided immediately before the higher level, respectively.

Output file name

The split output file names are output as sequential numbers connected by "-" (hyphen) before the specified file name extension (.html). The first page is the specified output file name.

Example of specifying index.html as the output file name.

index.html, index-1.html, index-2.html, index-3.html, …

Output HTML

<html>, <meta>, <style>, <link> (CSS), <script> (JavaScritpt) and <body> tags are common to all pages.

The <title> tag is set to [outline level 1 label] - [outline level 2 label] - [outline level 3 label] - [title set in the Word document information] for the relevant page.

Labels below the specified outline level will not be output within the <title> tag.

e.g. -split 1 is specified

[outline level 1 label] - [title set in Word document information].

Table of contents

The table of contents is output at the top of all split HTML files (immediately after the <body> tag).

If the "-tocout" parameter is specified at the same time, <div id="toc"></div> in the table of contents is output as a separate HTML file (inc-toc.html).

For details, please refer to "5.12.1 Table of contents for split output".

Page link

If the "-pagenavi" parameter is specified when the "-split 1|2|3" parameter is specified in the conversion options, links are output that go to the previous and next pages of the HTML file being displayed.

See "5.14 Page link output" for details.

5.14 Page link output

When the "-split 1|2|3" parameter is specified in the conversion options and the "-pagenavi" parameter is specified, links are output at the top (immediately after the table of contents, if any) and bottom (immediately before the </body> tag) of the split HTML file, based on the sequential number of the HTML file name to be output.

The link labels can be output in Japanese or English by specifying the value following the parameter:

Value

Link label

Note

ja

"前へ" and "次へ" in Japanese.

If there is no previous or next page, "前へ" or "次へ" links are not output.

If you specify anything other than "ja" or omit it.

“Prev” and “Next” in English.

If there is no previous or next page, "Prev" or "Next" links are not output.

5.14.1 Output HTML elements

If the value following the "-pagenavi" parameter is specified anything other than "ja" or omitted, the output is as follows. (Example of displaying the HTML source code of index-1.html among the split HTML files with the output file name index.html)

Tags output at the top

<nav>
<div class="pagenavi-wrap-top">
<div class="pagenavi-prev">
<a href="index.html">Prev</a></div>
<div class="pagenavi-next">
<a href="index-2.html">Next</a></div>
</div>
</nav>

Tags output at the bottom

<nav>
<div class="pagenavi-wrap-bottom">
<div class="pagenavi-prev">
<a href="index.html">Prev</a></div>
<div class="pagenavi-next">
<a href="index-2.html">Next</a></div>
</div></nav>