Saturday, 25 June 2016

A Closer Look at .docx Files

The default file type for Word documents has been .docx since Word 2007.  It's part of Microsoft's Office Open XML family, which also contains .xlsx (Excel spreadsheets) and .pptx (PowerPoint presentations).

Something that comes as a surprise to many people is that the .docx file type - along with the other file types in the Office Open XML family - is a compressed file type that can be opened with a file archiver like 7-Zip.  (If you have problems opening the file then change the .docx extension to .zip e.g. example.docx becomes example.zip.)
 
If you do this you'll see that a .docx is actually a wrapper for XML files, primarily of type .xml, with a few .rels files as well (.rels are XML relationship files that tell a program how the .xml files relate to each other):

 
So a .docx file isn't one file, it's a whole bunch of files that are parsed by Microsoft Word into a coherent whole for the user to work on. Word is essentially a GUI front end for XML. 

The correct technical term for a Word document is a package (the .docx file) which is comprised of parts (the .xml files) which are connected using relationships (the .rels files). 

The list of files and folders in a new, blank Word 2016/365 document is as follows:
  • \docProps\app.xml - Metadata from the application.
  • \docProps\core.xmlMetadata from the document.
  • \word\theme\theme1.xml - The default theme for the document.
  • \word\_rels\document.xml.rels - The relationship file that contains the relations between document.xml and other parts of the package.
  • \word\document.xml - The actual content in your document.
  • \word\fontTable.xml - A list of all the fonts used in the document.
  • \word\settings.xml - Various documents settings, including defaults and security settings.
  • \word\styles.xml - The styles that are available in the document (not the styles that have been used, all the styles that are available).
  • \word\webSettings.xml - The web-specific settings used by the document.
  • \_rels\.rels - The package level relationship for the document.
  • [Content_Types].xml - The various types of parts that are used in the package.

If you add things to your document then other folders and files will be created by Word. 
For example, embedded media such as images will be in the \word\media folder and embedded files such as spreadsheets will be in the \word\embeddings folder.  

An important issue to note is that when you embed an image and then edit it, the original image is stored in the \word\media folder, and the edits are stored in  \word\document.xml, in the <w:pict> element.  If you've cropped or otherwise obfuscated private information using Word's inbuilt editing tools then anyone who opens the .docx package using something like 7-Zip can still see the original picture.

There is a lot more that can be said about the Open XML format, so rather than rewrite the wheel here are some links to explore:
MSDN contains plenty of useful OOXML information for users and developers, and the specifications and XSD information linked to above will tell you everything you could ever want to know about how the .docx structure works.

Finally, if you're thinking that this is very interesting but not particularly useful for writing topic-based documentation then you should know that Yes, you can write DITA in Word!