We currently have a challenge of converting over thousand Word 2003 document into Sharepoint publishing page. Here is our approach:
Step 1: Run Microsoft document convert tool to convert word document into docx
You can download it free from Microsoft at: http://www.microsoft.com/downloads/details.aspx?familyid=13580cd7-a8bc-40ef-8281-dd2c325a5a81&displaylang=en
Step 2: Call out of box document convert to convert docx into aspx
With thousands of files to convert, we have to write a code to do the job. Use Microsoft.Sharepoint.Publishing.PublishingPageCollection class to add new publishing page.
public PublishingPage Add (
string newPageName,
SPFile fileToConvert,
Guid transformerId,
PageConversionPriority priority )
This approach works however we now have 2 issues
1. About 10% of the documents does not convert. The error message shows internal error which does not help us identify the problem. After spending lots of time on the word document, we think this error is relate to the word document format. Such as bullet, section break…at this time, we can not certain exactly the problem
2. Some of the Word format lost or style has been changed after convert.
To work this around, I have another idea to bypass the converter. For example, to convert 100.doc to 100.aspx we follow the steps:
1. create an empty 100.aspx publishing page first. Of course, we know the content type and the field where to save the html content.
2. convert 100.doc to 100.htm using the default office behavior. You could do this by writing a piece of code or just open the 100.doc and save as a html file.
3. programtically paste html text into the 100.aspx publishing page content field.
Note: the html saved by Word will have a style css within the html page. Some class might have a conflict with you Sharepoint master page. Make sure delete those class. However, keep css style in the html and paste into the content field is not the best practice. I'd rather save those style into the css file which the master page is using. If you do this, when paste html text to the content field, make sure delete all style section from the html content.
It is faster and all the word format will be kept even the bullet and lines in the word document. However, I really not sure if this is a good idea or not but at least it gives you an alternative way of converting word document into Sharepoint.
I will post entire solution when the code is complete.