Converting a WordPress blog to Hugo
As I’ve discussed in other places, I’ve tried quite a few times to have a blog. That sounds pretty simple, but here are some notes on converting one of those blogs which was a hosted WordPress blog.
If you do not care to read (or about) all the words and explanations, feel free to skip to the SHORT VERSION
BACKGROUND
Inevitably I will be excited about a blog and sharing information for a bit of time. But then I’ll get sidetracked or focused on the wrong thing (such as trying to make it look awesome rather than providing awesome content). And it will eventually get abandoned. In my most recent foray into blogging, I signed up for a hosted WordPress blog.
There is some decent (probably not great) content there and when I got an email reminding me about it and telling me I’d be charged again soon I decided that I should just let it wither away to wherever blogs go to die. But I want to keep that content under the guise of retaining the information (but the reality of being a packrat).
So I need to convert the content that is currently stored in a WordPress database into markdown so I can “keep” it here. [note to future self: explain the reasoning behind this shift to static pages vs. WordPress]
CHALLENGE #1
There is a lot of great information online about just this type of conversion
using pelican-import
. Now I am not currently using pelican
but found that
this would provide a path to where I wanted to be. I installed everything
required per the great pelican
documentation but every
attempt to run the very well documented command failed with something like:
--normalize has been removed. Normalization is now automatic.
--parse-raw/-R has been removed. Use +raw_html or +raw_tex extension.
Try pandoc --help for more information.
Please, check your Pandoc installation.
It would create a single .html
file and bomb so I jumped to the conclusion
that the file I was attempting to import was corrupt or not what the tool
expected. Every page I found kept casually referring to an XML export of a
WordPress database but provided no information about how to do that or how to
generate that file. I am going to continue that trend and only provide a link
that seems relevant: https://codex.wordpress.org/Tools_Export_Screen
Full disclosure: this confusion was probably due to my ignorance around WordPress and the guilt I found around paying BlueHost for it but never learning more about it…
SOLUTION #1
As it turns out, at the time of this writing, there is a
Github issue about this.
I followed the guidance in that issue and found in
this
page which required that I download the pelican
code from Github, modify the
pelican/toosl/pelican_import.py
file, and then install using python setup.py install
from the directory where I downloaded the code.
Full disclosure: if these “details” seem fuzzy it is because I was really floundering around when I got this working and did not take the type of notes I’d prefer to be able to provide. Please just accept my appology and read the information in the link above…it will make all of your wildest dreams come true.
Now you should be able to run pelican-import
similar to the following (I
expect to do some post-processing so I ran a pretty generic command):
pelican-import -m markdown --wpfile -o <OUTPUT_DIR> <EXPORTED_WORDPRESS_XML_FILE>
CHALLENGE #2
The pelican-import
utility gets me a lot closer to where I want to be.
Unfortunately, the heading is not formatted in the way that I would like it to
be. This is also noted by Matty at
Prefetch Technologies.
These observations and the way he addressed them were insightful, but I modified
it a bit for my use. The majority of the work is in getting the “front matter”
correct.
SOLUTION #2
Unfortunately, what was produced from WordPress is not consistent (…or maybe the reality is that what I produced in WordPress in not consistent…whatevs). For instance, some files have tags and others don’t…I have to account for that when I process them. I have decided that all of these converted posts will be tagged with “converted” “archive” and the name of the blog they came from. So here is roughly what my processing command looks like (and see the explanation below the code snippet):
1sed -e '1s/^/---\n/' \
2 -e '/Title:.*/a Subtitle: (converted post)' \
3 -e 's/^Date: \(.*\) \(.*\)/Date: \1/' \
4 -e '/^Slug/d' \
5 -e 's/^Category: \(.*\)/Categories: ["\1"]/' \
6 -e '/^Tags:/{h;s/,\s/","/g;s/:\s/: ["/g;s/$/","converted","archive","ilearnedthisthehardway"]/g};/^Status:.*/{x;/^$/{s//Tags: ["converted","archive","ilearnedthisthehardway"]/;H};x}' \
7 -e '/^Status: published/c---' \
8 ilearnedthisthehardway.posts/the-yard.md
Here is what that mess is doing, line by line:
- Insert a
---
before the first line - Append
Subtitle: (converted post)
line after eachTitle:
line - Remove the garbage after the date, leaving only
YYYY-MM-DD
- Delete the
Slug:
line since it isn’t used (I could have done this at time of import with--disable-slugs
, but I didn’t…) - If
Category
is specified in original file (it isn’t always), change it to beCategories
which Hugo will recognize - This is a tricky one but if
Tags
are specified, I want to keep them. Unfortunately they are not always specified and when they are, they are in the wrong format. So, if they exist, change the format to["tag1","tag2"]
which Hugo understands and append the tags I want to be included. If there are not tags specified initially, add the ones I want…see this page for a great explanation. - Change the
Status
line (since I don’t use it) to end my front matter---
NEXT STEPS
- I need to actually do this work…the post was largely to make note of the progress and commands I need to use, the work remains.
WHAT I LEARNED
- Linux is fun!
REFERENCE
- Exporting Wordpress Posts To Markdown
- Move from wordpress to static site with Jigsaw
- sed: how to replace line if found or append to end of file if not found?
SHORT VERSION
To avoid tl;dr enjoy this instead:
## 01 Fix Pelican and build myself (lost the notes here...sorry) to convert
## 02 Process the converted files
sed -e '1s/^/---\n/' \
-e '/Title:.*/a Subtitle: (converted post)' \
-e 's/^Date: \(.*\) \(.*\)/Date: \1/' \
-e '/^Slug/d' \
-e 's/^Category: \(.*\)/Categories: ["\1"]/' \
-e '/^Tags:/{h;s/,\s/","/g;s/:\s/: ["/g;s/$/","converted","archive","ilearnedthisthehardway"]/g};/^Status:.*/{x;/^$/{s//Tags: ["converted","archive","ilearnedthisthehardway"]/;H};x}' \
-e '/^Status: published/c---' \
ilearnedthisthehardway.posts/the-yard.md