Generating Table of Contents for Markdown File
From time to time everyone will produce bigger markdown file for documentation purposes. To have a better overview what is inside of this one big markdown file it is suitable to include Table of Contents (ToC) on the top of it.
If you manage multiple git repositories with README.md
in each of them, writing ToC is a time consuming job. So I did my research on what tools people use to create ToC and came across this StackOverflow question which contained a lot of scripts that can help with this task.
There were scripts and programs written in every popular language, but nobody answered with a bash script that can generate it.
So I thought…
Why not write a simple bash script with no dependencies that can generate ToC from any markdown file correctly? It should not be a big deal and bash could handle it very well.
And so I created ToC generator in bash and answered the question on StackOverflow. In this blog I will describe what was my development approach and obstacles that I hit on the way.
Headings
Markdown engines are used to produce HTML page. This HTML page contains special navigation links generated from the markdown headings which are denoted by #
sign. So for example clicking on link https://github.com/kubernetes/kubernetes/#to-start-developing-k8s will directly move your browser to heading to-start-developing-k8s
.
The headings also have levels:
- #
is heading level 1
- ##
is subheading of the previous #
- The same for more nested headings up to six (######
)
Main goal in my algorithm is therefore to get all lines starting with a #
sign and save them in some kind of data structure. In this case an array will be sufficient, because order of lines in ToC will be the same as when we iterate over the file.
Code Blocks
As second thing, I wanted to not produce links from lines in code blocks. Many languages are using #
as sign to comment a line and as mentioned above the #
in markdown indicates start of the heading.
The code blocks start and end with three backticks ```
. So first part of my algorithm was:
- Go over the file line by line
- Ignore text when line starts with code block
```
until the code block ends - Save all lines that start with
#
Output
This theoretically should be sufficient in order to produce ToC. The script will just output the markdown links formatted in list with correct indentation based on level of heading and we are done.
But! But it will not that simple!
Markdown Links
It’s not that simple, because of how anchor links look like after they are rendered from markdown file to HTML web page.
As you know an link anchor in HTML page can point to an element within an HTML document. This is why clicking on some links will move your browser to correct place inside of the same page. This kind of anchor part ending with #
is called fragment identifier.
For example this anchor https://example.com/blog/my-blog#section-one
has fragment identifier #section-one
and after clicking your browser will be moved to HTML element with element ID set to id=section-one
.
Here is example how the ToC link[Contribution](#contribution)
is rendered in GitHub:
<a href="#contribution">Contribution</a>
And this is example how the heading is rendered in GitHub (note that it also points to itself):
<a id="user-content-contribution" class="anchor" href="#contribution">
So now you ask why the heck is the element ID of heading prefixed with user-content-
. This is specific for sites such as GitHub and Gitlab. They are prefixing the ID like this probably to avoid users to create HTML elements with same IDs as IDs of other elements in GitHub web pages.
Side note: For example Gitea already reverted this feature and Gitlab already has one issue open.
To bring us back on track, our goal in this section was to output markdown link, which will contain the same ID as the heading after the rendering to HTML. This means if the script produces line like this:
- [Contribution](#contribution)
It will render text Contribution
linking to heading with id=contribution
.
Generating the Header ID
After previous theoretical section, where I hope you learned something new, we can head to why the solution is still not enough and will not produce working ToC.
There are multiple alterations which markdown engine is doing with the contents of heading before it produces ID that serves as anchor. In my case I reverse engineered it by trying various combinations of special characters, spaces, letter case and so on and rendering them in one testing Gitlab repository.
My reverse engineer-ed set of rules was following:
- Remove link part (eg.
[link](https://link.com)
) from markdown links if found in the heading; - Remove all characters besides alphanumeric ones, dashes, underscores and spaces (actually underscores are preserved in the links);
- Remove first space
- Substitute spaces for dashes;
- Lowercase everything
- Then if there are multiple dashes next to each other, convert them to only one;
- And at the end output everything in markdown link format.
This code still could be potentially simplified, but as you know sometimes you are just don’t have energy to make everything awesomest it could be done.
Why did contributors not added any comment to the part of code with complicated regex functions and patterns is not understandable for me.
Explanation:
text.downcase
: Lowercase the heading;NON_WORD_RE
: Match inversion of space\t
, dash-
or full word. So it will. By this we can delete all special characters;MARKDOWN_LINK_TEXT
: Picks onlylink_text
from the input and uses it in header id. This means if there is a link in the heading, the text part will be preserved in the ID.result.tr!
: Substitute all spaces and tabs for-
;.freeze
is just used to denote constant;
The code is not very readable, but that applies for almost all complicated regexps.
Last Words
I hope you learned something here and if your markdown files lack table of contents, go ahead to my repository and generate one.
Open https://github.com/Lirt/markdown-toc-bash and clone the repository or just wget
the script markdown-toc.sh
and run
./markdown-toc.sh <PATH_TO_README_FILE>
It will output ToC like this which you can grab and put into your markdown file
## Table of Contents- [Kubernetes (K8s)](#kubernetes-k8s)
- [To start using K8s](#to-start-using-k8s)
- [To start developing K8s](#to-start-developing-k8s)
- [You have a working [Go environment].](#you-have-a-working-go-environment)
- [You have a working [Docker environment].](#you-have-a-working-docker-environment)
- [Support](#support)