Word’s Filtered HTML is not like pure filtered water. It’s better than unfiltered HTML, in that Word-specific information is discarded, but there’s a lot of icky stuff floating around in the pot that can drag down your efforts to create a clean and valid ePub. Attention to detail will prevent it from blowing up in your face.
After saving the original Word document as Filtered HTML with UTF-8 encoding, there’s always some simplification and cleanup to do before polishing up the formatting for a beautiful ePub. Don’t skip this step if you want the final ePub to be easy to adapt to several publishing outlets, such as Amazon, Barnes & Noble, and iBookStore.
For cleaning up Word HTML, use a text editor that supports UTF-8, search and replace, and preferably, regular expressions. I like Edit Pad Pro or UltraEdit. Sigil is a great ePub editor, but it’s also well-suited to the task of cleaning up HTML and CSS. You can, indeed, open your HTML document in Sigil, clean it up, fine-tune the format, validate, and save as ePub—start to finish without using another program.
ePub needs an XHTML doctype
ePub uses XHTML 1.1 strict, so the first thing to consider is replacing the <html> tag in the converted HTML file with a proper doctype. If you plan to use Sigil as your ePub editor and conversion utility, you may simply wait and let Sigil do it. The end result will be the same.
Replace these lines:
<html> <head> <meta http-equiv=Content-Type content="text/html; charset=utf-8"> <meta name=Generator content="Microsoft Word 14 (filtered)">
With the following lines:
<?xml version="1.0" encoding="utf-8" standalone="no"> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head>
Note that the XML declaration includes the encoding standard, utf-8. You can safely jettison both meta tags from Word’s filtered HTML output. The charset metadata is redundant and nobody cares about Word being the generator.
Likewise, if Word included a “lang” attribute in the <body> tag, go ahead and delete it. Your ePub file package will include a metadata file named “content.opf” (or whatever.opf, doesn’t matter) in which you specify the language of the book, so the information doesn’t need to be repeated in each file’s body tag.
Remove Word’s Font Definitions
Following the <head> tag, Word’s filtered HTML includes a <style></style> tag containing @font-face definitions, style definitions, and assorted oddments from the conversion. Word is just trying to be helpful and will include most of your installed Windows fonts in a long list, a very, very long list if you have many fonts.
Delete the entire font list between this line:
/* Font Definitions */
And this line:
/* Style Definitions */
The stuff in-between these lines is recognizable as your list of Windows installed fonts:
@font-face
{font-family:Courier;
panose-1:2 7 4 9 2 2 5 2 4 4;}
[…and on and on for miles, until you get to the end, just above Style Definitions]
@font-face
{font-family:ZWAdobeF;
panose-1:0 0 0 0 0 0 0 0 0 0;}
/* Style Definitions */
Many eReaders do not yet support embedded fonts and even if they did, you wouldn’t want to keep that huge list of @font-face declarations. While it is possible to use embedded fonts and Sigil makes it fairly easy to do, I would not make that leap unless I had a compelling reason. In any case you can safely delete all the @font-face definitions that Word inserts. If you decide to embed a font or two anyway, say for Chapter headings or the title page, include CSS in your style sheet to specify the font, along with a generic alternative for eReaders that will ignore it.
Don’t be afraid to also delete the “panose” information. The PANOSE system is a way of classifying fonts according to their letterform characteristics. You don’t need this information in your HTML file for ePub production.
Clean up Word Styles
Next, look through the Style Definitions. If you have a clean Word doc, there won’t be many. If you’re working on someone else’s book file, there may be a wagonload. I spend a significant amount of time here amid the styles to avoid problems later. Look at what’s there, view the document in a browser, and then set to work to simplify Word’s styles.
You may want to keep a few simplified Word styles or, as I do, simply replace all of them with a shorter list of what is necessary to achieve the author’s intended structural scaffolding without being overly fussy. Do keep a copy of the original filtered HTML output for reference.
Remember that eReader devices and Apps have their own default fonts, margins, and restrictions. Screen size varies from tablets to smartphones. The reader may change the font, font-size, screen colors, and the orientation of the eReader (vertical or horizontal). Ergo, there is simply no need for most of Word’s precision styles.
Start the style clean-up with the most-prevalent Mso.Normal styles. Word makes them look more complex than necessary by repeating the same style for several elements:
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
text-indent:.5in;
text-autospace:none;
font-size:12.0pt;
font-family:”Times New Roman”,”serif”;}
If you want to keep the MsoNormal class, delete everything but the text-indent and make that a percentage so it works on all eReader screen sizes. Keep margin-top if you’re writing non-fiction with block-style paragraphs or want a little space between each paragraph. If you are using indented paragraphs, a blank line between paragraphs is unnecessary and can look amateurish.
Word expresses font-sizes with points and dimensions with inches, but these absolute units are more suitable to print media than to eBooks. Stick to ems and pixels for fonts and percentages, if possible, for widths and heights for eBooks that adapt better to various screen sizes and resolutions.
Here’s our final “MsoNormal” style:
.MsoNormal {margin:0; text-indent: 10%;}
I prefer to replace .MsoNormal with a simple paragraph style. List-items and divs are styled differently than paragraphs, so it’s pointless to bundle them into the same CSS selector:
p {margin: 0; padding: 0; text-indent: 10%;}
You may delete all of the inline .msoNormal classes from the document without consequence. For example, thousands of paragraph tags like this: <p class=msoNormal>
can become unadorned paragraph tags like this: <p>
Clean up Header Styles
Word’s CSS declarations for header tags should be modified for eBooks. They contain some unnecessary properties but need the addition of one or two missing ones. That is, we don’t need font-family or font-size, but we do need to control spacing above the headers, font-style, and justification.
Here’s some typical Word CSS for h1 and h2:
h1
{mso-style-link:”Heading 1 Char”;
margin:0in;
margin-bottom:.0001pt;
text-align:right;
text-autospace:none;
font-size:16.0pt;
font-family:”Times New Roman”,”serif”;
letter-spacing:1.0pt;
font-weight:bold;
font-style:italic;}
h2
{margin:0in;
margin-bottom:.0001pt;
text-align:right;
text-autospace:none;
font-size:22.0pt;
font-family:”Times New Roman”,”serif”;
text-transform:uppercase;
letter-spacing:1.0pt;
font-weight:bold;
font-style:italic;}
In practice, we need something more like the CSS declarations below, with a few redundant styles for dumb eReader apps. Some eReaders don’t support font-size, font-family, margin-bottom, letter-spacing, or text transform.
In the basic styles listed below, I’ve added comments that aren’t in my real style sheet so that they might make more sense:
/* Remove default text-indent that some eReaders add and ensure bold weight */
h1, h2, h3 {text-indent:0; font-weight:bold;}
/* On the title page add a wide top margin and center the header. Some authors like headers in italics, others prefer all-caps or simple Roman style.*/
h1.tp {margin-top:20%; text-align:center; font-style:italic;}
/* Major book Parts might be aligned right, display in italics, and have a large top margin. Margin-right works in ePubs but Kindle is reliable only for margin-top.*/
h1.part {margin-top:30%; text-align:right; margin-right:20px; font-style:italic; page-break-after: always;}
/* In front and back matter, spacing is less dramatic than for chapter headings */
h2 {margin-top:10%; text-align:center; page-break-after: avoid;}
h2.chap {margin-top:25%; text-align:right; font-style:italic;}
If you create a separate ePub for Kindle conversion and don’t split it into separate files, add a CSS class for .pagebreak and insert it BEFORE (not inside!) your h1 or h2 headers and front-matter sections.
/* For Kindle books, use a simple div with a class of .pagebreak to add page breaks. Do not use the deprecated <br clear=”all” /> tag.*/
.pagebreak {page-break-before: always;}
In ePubs for Kindle conversion, usage is:
<div class=”pagebreak”/> OR <div class=”pagebreak”></div>
<h2>Chapter 3</h2>
NOT
<h2 class=””pagebreak/”>Chapter 3</h2>
You may also combine the pagebreak class with meaningful div names:
<div id=”toc” class=”pagebreak”></div>
Some people prefer to use .pagebreak {page-break-after:always;}
As long as you know what you’re doing, that’s fine.
Move all Inline Styles to your CSS style sheet
Move inline-styles in the document to the CSS style sheet to reduce file-size and for easier editing. If you don’t, Sigil will do it for you, but then you’ll need to merge duplicate styles or live with some messiness in your style sheet.
Don’t worry about a few redundant declarations or “fully loaded” styles to accommodate Kindle formatting, which does not understand multiple classes applied to a single element.
If you do include multiple classes in an element, such as a div or paragraph, Kindle will keep the last one and ignore the rest. For example, if you have a paragraph with the classes center italic (<p class=”center italic”>), Kindle will make it italic and ignore the centering. The workaround of this inconvenient problem is to load a class with as many styles as needed to achieve an effect (e.g., centerItalic {text-align:center; font-style:italic;}), or by using an occasional span class. Surely this idiotic behavior will eventually go away.
Keep CSS simple and workable
A K.I.S.S. approach to CSS pays off in problem-free and nicer-looking eBooks. There is no “one-size fits all” but only a few styles are truly necessary in every book. Some authors are fussier than others about formatting, so I advise patience and some experimentation to find the best solutions for your work.
My method is to work through the Word styles, evaluating each one, or a simpler variation of it, for inclusion in the style sheet. Some Word styles will have properties that are ignored in eBooks or are repeated in other styles. Don’t keep redundant Word styles and stick to simple CSS that creates a similar, consistent look. For example, I’ve worked with manuscripts that had six slightly different ways to indicate a scene change, but they were all variations of blank lines above and below centered special characters. One consistent style is better for maintenance and as a visual cue. Another manuscript had 15 different styles for H3 when only two were needed. Just keep it simple!
Use search and replace and/or regular expressions for deleting inline styles or inserting classes. Regular expressions are powerful and can select more HTML markup than intended, resulting in lost content when replaced. Save a backup and check the behavior of each “regex” that you use. Then, if an expression turns out to be too greedy, undo the replacements or restore from backup and use regular search and replace or a series of simpler regular expressions that do what you want in several steps. You can learn to write better regular expressions that don’t grab more markup than they should, but it takes practice.
Here are a couple of regexes that may help you with cleanup.
To remove all inline styles and/or classes in paragraph tags, find this:
<p\b[^>]+>
Replace with:<p>
Translation: Find each <p> tag containing the letter “p” followed by a word boundary (so you don’t match “<pre”) and remove all classes and styles, leaving just the <p> tag
To remove empty paragraphs, find:
<p> *(?: )* *(?:<b>)? *(?: )* *(?:</b>)? *(?: )* *</p>\r\n
You may want to find empty paragraphs one at a time because they often indicate scene changes or places where more white space is needed. Use top-margin CSS styles to add white space above paragraphs or headers instead of empty paragraphs.
Replace with nothing.
To remove tags that result when authors hit the spacebar too many times, find:
*(?: )+ *
Note: There is a space before each asterisk (*).
Replace with: 1-space (i.e., press spacebar once)
Why the space? We carefully replace with one space to avoid losing spaces between words, especially after a line break. Follow up by replacing two spaces with one space using a simple search and replace.
Note that \r\n finds Windows-style line breaks. If you use Unix style line breaks, look for \n instead.
I used to routinely remove line breaks within tags in order to spot certain anomalies, but Sigil will “pretty format” your code automatically, so it’s not a required step. If you want to learn more about using regular expressions, look into Regex Magic and Regex Buddy by Just Great Software
A simple ePub style sheet
Start with minimal styling and add to it as needed. I would consider minimal styling to include h1 and h2 headers for front or back matter and for chapters, paragraph formatting, blockquotes, and images. In addition, you might need special header or paragraph formatting on the main title page, an HTML table of contents if there is a hierarchical arrangement, and scene change separator spacing (blank line equivalents, special characters).
A minimalist style sheet might look something like this:
/* Headers */
h1, h2 {text-indent:0; font-weight:bold;}
h1.part {margin-top:30%; text-align:right; margin-right:20px; font-style:italic;}
h2 {margin-top:10%; text-align:center; page-break-after:avoid;}
h2.chap {margin-top:25%; text-align:right; margin-right:20px; font-style:italic;}
/* Elements */
p {text-indent:10%; margin:0;}
img {border:none;}
img#coverimg {max-width:100%;}
/* Classes */
.attrib {text-align:right; text-indent:0;}
.blank1 {margin-top:15px;}
.blank2 {margin-top:30px;}
.blank1c {margin-top:15px; text-indent:0; text-align:center;}
.blank2c {margin-top:30px; text-indent:0; text-align:center;}
.center {text-align:center; text-indent:0;}
.chapfirst {margin-top:10%;}
.ded {margin-top:20%; font-style:italic; text-align:center; text-indent:0;}
.italic {font-style:italic;}
.noind {text-indent:0;}
.pagebreak {page-break-before:always;}
/* Table of Contents styles */
.tcfirst {margin-top:30px; text-indent:5%}
.tc {margin-top:5px; text-indent:5%}
.tcsub {margin-top:5px; text-indent:10%}
Do experiment with percentages for margins and use them where possible. You don’t have to add a margin-right with right-aligned headers unless they look cramped to you, and I’d try 1% or 2% value instead of 20px to evaluate results. Same goes for blank “lines” used as separators. The reason I used pixels for blank1 and blank2 is that using ems with very large text sizes can create too much space when all you want is a visual cue or separator. It’s your call.
You will probably have other formatting situations that require additional styles. The main thing is to keep styles in the style sheet instead of scattered inline and remember that when you use your ePub as the basis of a Kindle book, an element can have only one class. In practice you may end up with a couple of very specific styles to accommodate Kindle, such as .bigBoldItalicNoIndent! If you’re a web designer, you’ll cringe, but the main thing is to make the book look good and keep the code as uncluttered as possible. If you can do that, maintenance is easier and file size will be reasonable.
Develop your style sheet as an internal one inside your (X)HTML file. Then look at the book file in your browser to check results. You won’t have page breaks, of course, but it’s a good way to see the effect of styles immediately. You can use a separate style.css file if you wish or wait until you open the html file in Sigil, and add a style sheet as your first task. More about that in the next step.
After removing the cruft, take a long break and proceed to Word to ePub – 4: XHTML to ePub with Sigil
D M Blank says
This is very useful, so thank you. Most of it I had already figured out through trial and error with filtered Word html, but I am having a problem that I hope you might know something about. I have simplified the styles and created and linked an external stylesheet. Things display fine in Sigil, but when I go to test the epub in Adobe Digital Editions and on Nook, the entire stylesheet is ignored. It feels like Word embedded some stylesheet that cannot be seen/opened. The only way I have worked out to override this is to copy the entire stylesheet code into the header of each file of the epub. Not very efficient, especially if anything changes at final editing.
Any ideas?
Thank you!
Araby Greene says
Thanks for the kind words! I also use Adobe Digital Editions for quick proofing because it’s handy.
There are several things to check. It sounds like the link to your stylesheet got lost when you added new pages to your document. The link needs to be in the head tag of each page. To ensure that it is, start off with a link to the stylesheet in the original imported file before you start splitting the document into separate files at page breaks.
That is, splitting the document retains the title tag and a link to the stylesheet in the head tag of new pages. Creating a new blank page does not preserve the title tag or stylesheet link.
When I start working on a document in Sigil and have finished preliminary cleanup of styles, I add a blank style sheet in the Styles folder, copy and paste my inline style sheet into it (minus the opening and closing <style> tags), and then create a link to it in the XHTML source. The link looks like this:
<link href=”../Styles/Style0001.css” rel=”stylesheet” type=”text/css” />
To fix your problem, it will be faster to use Sigil’s built-in option to add style sheet links to your XHTML files than to go through and paste the missing link in each file (if that’s the problem). To do that, select all the text files that need a stylesheet link, then right-click and select “Link Stylesheets.” Select the style sheet file(s) and click OK.
It’s also possible that there are some inline styles leftover from Word in the source code. To find them, switch to source view in Sigil and search All HTML files for: style=
Word is notorious for adding crazy span tags with unneeded inline styles, so I wouldn’t be surprised if some remain even though you’ve made an effort to clean up after Word.
Sigil also has a helpful Tool to delete unused styles in your stylesheet, and Reports, which will list styles in use in your document. Running the ePub validator can also find stray Word markup that is invalid in XHTML.
Hope this helps.
Helen says
Waiting for part 4 🙂
Araby Greene says
Me too. I was interrupted by several websites and two books from hell. I’m awash with guilt. I decided to not take on any more freelance projects, maybe permanently. Right after seeing the light of opportunity to work on my own stuff, I got a call from my local library friends group. Their webmaster passed away and nobody volunteered…
The frosting on the cake is that the latest version of Kindlegen made some changes that affect Mobi conversions. Notably, margin-bottom, which was previously forbidden, is now preferred over margin-top, which appears to be ignored on divs. Blockquotes remain perverse in Mobi, but work in KF8.
Thanks for your comment. I needed the nudge.
Matt says
Thank you so much for the helpful article. I have a problem I think you would know a solution to. In the body of my book I just have one style: 3% indent. I use inline style for this one simple thing but sigil puts css style in the head. I change it back and save it but if you open the epub again it does it again. Why do they not want you to use inline style? How can you get it to not to change your epub?
I would really appreciate some help with this even if it is just an off the top of your head thought. Thanks for reading this.
Araby Greene says
You can change Sigil’s cleaning behavior by opening Edit/Preferences (F5). Select Clean Source and choose Pretty Print Tidy instead of HTML Tidy. If that doesn’t work for you, clear the checkboxes for Open and Save under “Automatically Clean and Format HTML Source Code.”
I like the automatic clean and format options because they prevent ePub validation errors later. Nevertheless, I prefer Pretty Print Tidy to HTML Tidy because it’s less aggressive. I avoid using inline CSS because it’s deprecated in XHTML 1.1, it’s harder to maintain than a style sheet, and it adds clutter to the code. On the other hand, an inline style might be a workaround for Kindlegen when multiple classes are needed to style a special section, or useful for unique one-off formatting. But if several or all paragraphs need an inline style for text-indent, it makes sense to use a style sheet because you can edit the style instead of the document to make global changes.
In the case of paragraph indents, defining a default style in a style sheet lets you edit the document more quickly, since you automatically get a <p> tag that’s consistently formatted. Creating a new style sheet in Sigil is very simple. Just right-click and select Add blank stylesheet. To give all <p> tags a 3% indent, add the style: p {text-indent: 3%;}