magical shell utilities: tsort
Written: 2020-10-30 to 2020-10-31Graph theory comes up a lot more often than you think. Whenever you
run a search, download packages, or do anything on the Internet really,
you’re reaping its benefits without even realizing. You might be
familiar with sort
if you’ve done any work in the shell
beyond rm -rf /*
. It’s useful for working with huge
unsorted files, and it’s pretty much an essential tool for any serious
scripting. There’s another program with similar intentions, but
completely different usecases - tsort
. It’s a relative
newcomer to the POSIX.1
standard that
was only canonized in 2017, although it’s been around since 1979.
tsort
takes any number of space-separated pairs that
represent directed edges in a graph, and outputs each node in
topologically sorted order. Its original purpose was for use in
ld
, the UNIX standard linker. Back in the days when
tar
actually referred to tape archives, the linker could
only do a single pass over a series of object files, in order. With this
processing format, problems would arise when the linker tried to figure
out symbols or functions that were invoked before they were defined. The
solution for this? Represent a program as a directed graph, and generate
a topological sorting of symbol references first.
What’s a topological sort?
If you’re unfamiliar with graph theory, here’s a short refresher. A directed graph is a collection of nodes with arrows between each other, pointing towards another node.
I made these figures in Snapchat
,
bear with me.
If we have two nodes like so,
We can turn it into a directed graph by drawing an arrow that points from A to B. You can look at these arrows as family trees, flow-charts, morphisms, or sloppy artwork.
A special subset of directed graphs are directed acyclic graphs,
where you can travel along every path emanating from a node and
eventually run out of paths to take. This turns out to be extremely
useful in a ton of domains. Whenever you compile a project, you’re
creating a dependency graph of modules. Your Main
module might depend on Lib
, which in turns depends on Bloat, Parse
, and is-even
. Any
dependent relationship between modules forms a directed edge in the
overall graph representing the set of all your modules.
When we have cycles in our graph, however, that’s when issues arise.
If you don’t keep track of which nodes you’ve been to, you’ll end up
going around in circles as you walk the graph. You’ll go from A
to B
, see
that there’s a path from B
to A
, and keep going. This is a trivial example
obviously. You might be thinking, “I’m not that stupid, I’ll just keep
track of which nodes I’ve been to.” Sure, but what if the graph was
bigger? How could you represent this graph as a 1-dimensional array?
Cycles are fine depending on what you’re trying to do. If you’re writing a game and you need to implement pathfinding around obstacles, you might rely on a breadth-first search or some version of A*. Generally, if you know where you want to go, cycles aren’t that big of a deal. Assuming you know which McDonald’s you wanna head to, you probably won’t get tripped up by cloverleaf interchanges on the highway. If you’re trying to map the highway however, this becomes an issue.
By the way, trees are always directed acylic graphs, with the special restriction that a node can only ever be pointed to by one node.
A topological sort is a transformation that’s only valid for graphs
without cycles. The idea is to permute every node in such a way that, if
a node A
points to a node B
, A
will
come before B
in the final sorted
permutation.
Let’s say we have a DAG that looks like this:
The nodes are approximately sized according to how many outgoing edges they have. If you wanted to walk this graph, visiting every single emoji, which one would you start at? You can only go in the direction that the arrows point.
Clearly, you would start at the emoji that doesn’t have any arrows
pointing toward it. From there, you can head to
explodey-head
, yeehaw
, and
kewl-dude
. explodey-head
points to
think
, and yeehaw
points to
balloon-face
. From balloon-face
, you head to
kewl-dude
once more, and you’re out of options. You can see
that even though some nodes can share directions, we still don’t have
any cycles. The order in which we walked the graph corresponds to its
topological sort - we always have to visit the originating node
of an edge before we can visit the destination node. For nodes
with the same “depth”, their relative order doesn’t really matter. It
doesn’t matter whether we visit yeehaw
or
explodey-head
first, they won’t take it personally.
This kind of traversal strategy is exactly what tsort
does. If we convert every edge to a pair of nodes, we can find a
traversal path in mere milliseconds. Note that the original permutation
of our pairs doesn’t matter.
$ cat emojiGraph.txt
explode think
cowboy laugh
ahegao explode
ahegao sunglasses
laugh sunglasses
explode cowboy
ahegao cowboy
$ tsort emojiGraph.txt
ahegao
explode
cowboy
think
laugh
sunglasses
Notice that think
comes after cowboy
, even
though they both have edges that are 2 degrees of separation away from
our origin. There’s a shorter, 1-degree path to cowboy
, so
he comes first.
We have our topological sorting now, a 1-dimensional representation of the DAG and a souvenir of our time reenacting of the Emoji Movie:
There’s a really interesting Leetcode problem that involves graph traversal in this way. The prompt goes something like this:
You have a list of n
courses you need to take, labeled
from 0
to n - 1
. Some courses have
prerequisites that need to be satisfied before you can take them. Given
n
, and a list of pairs [(a, b)]
, denoting that
b
is a prerequisite of a
, return the order in
which you need to take these courses. Notice that the ordering of edges
is backwards from our tsort
example.
So let’s go with a simple example. We’ll use part of my old university transcript.
= [ (HMB265, BIO120)
courses BCH311, BCH210)
, (HAJ453, HMB312)
, (HMB204, BIO130)
, (HAJ453, BIO220)
, (BCH210, CHM136)
, (PSL301, BIO130)
, (HMB312, HMB265)
, (HMB312, HMB204)
, (BIO220, BIO120)
, (HMB265, BIO130)
, ( ]
Where do we go with this? How the hell do I graduate if my course schedule is a complete mess? (In real life, there are credit requirements, corequisites and sometimes prerequisite requirements aren’t actually mandatory, but we’ll assume a simple dependency graph here.)
Let’s reshuffle this list a bit by hand now to make it a little more clear where we’re starting. Let’s try grouping together our courses by their incoming edges - in other words, assembling their total prerequisites.
= {
courseRequirements BIO120: [],
BIO130: [],
CHM136: [],
BIO220: [BIO120],
HMB204: [BIO130],
HMB265: [BIO120],
BCH210: [CHM136],
PSL301: [BIO130]
BCH311: [BCH210],
HMB312: [HMB265, HMB204],
HAJ453: [BIO220, HMB312]
}
If we look at these courses as nodes, we can see immediately that
there are 3 sources in our graph, nodes without any incoming edges.
Their in-degrees
are zero, and so we know that we need to
start from there. In our context, BIO120, BIO130,
and CHM136
are courses that we can take
immediately, and not have to worry about the registrar kicking us out
halfway through the semester.
Now what? It seems like BCH210
requires CHM136
, so we would have to
take it one semester afterwards. Same goes for HMB265, BIO220, HMB204,
and PSL301
. Let’s skip ahead to Christmas
break.Our first courses are done. We don’t care about them anymore, and
they’re on our transcript.
= [ BIO120
transcript BIO130
, CHM136
,
]
We can now remove these completed courses from our graph, and all their outgoing edges pointing to higher-level courses. This also means removing the incoming edges as well. Here’s our new graph:
= {
courseRequirements HMB265: [],
BIO220: [],
HMB204: [],
PSL301: [],
BCH210: [],
BCH311: [BCH210],
HAJ453: [HMB312, BIO220]
HMB312: [HMB204]
}
Nice! We’ve finished all the requirements for 5 courses! It’s pretty clear what our next semester is gonna to look like. After our second semester, our transcript has grown:
= [ BIO120
transcript BIO130
, CHM136
, -- second semester
HMB265
, BIO220
, HMB204
, PSL301
, BCH210
, ]
Are we done yet? Nope. Looking back at
courseRequirements
, HAJ453
still requires courses that we haven’t gotten around to yet. Let’s
reorder our requirements list again.
= {
courseRequirements BCH311: [],
HMB312: []
HAJ453: [HMB312]
}
We take BCH311
and HMB312
next, before we can do HAJ453
in our senior year. Electives and
failed courses aside - in total, it took us 4 iterations to go from
being clueless freshmen to becoming well-respected biologists.
= [ BIO120
transcript BIO130
, CHM136
, -- second semester
HMB265
, BIO220
, HMB204
, PSL301
, BCH210
, -- third semester
BCH311
, HMB312
, -- fourth semester
HAJ453
, ]
Do you see the pattern in our strategy? We did a topological sort by
hand, using Kahn’s
algorithm. Starting off with a randomly permuted list of elements
(a, b)
, where b
has to be fulfilled before
a
, we identify which courses we’re eligible for (nodes with
no incoming edges). Next, we “visit” these nodes by taking the courses,
confident that we’re not lacking any background knowledge. After
removing our visited nodes from the graph, we delete their edges and
take note of the next batch of zero-degree nodes. Keep doing this until
we’re out of courses to take, and the University shoves a diploma in our
hands then cancels our health insurance.
This sucks to work out by hand, and we only had to deal with 11 courses. What about 40? (flashbacks to dealing with U of T’s degree explorer). What if we wanted to take every single course at the University? Where do we even start then? We need computers - let’s express this in their language.
def gradPlan(requirements):
= []
transcript
# each entry corresponds to a set of unfulfilled requirements
= defaultdict(set)
preReqs
for (prereq, course) in requirements:
preReqs[course].add(prereq)
# start our frosh week by selecting courses we can enroll in
=
eligibleCourses list(
filter(lambda course: not preReqs[course], preReqs)
)
while eligibleCourses:
= eligibleCourses.pop() # take any course that we're allowed to
current # add that to our transcript
transcript.append(current)
# now that we've taken the class, see which courses we've fulfilled a prereq for
= filter(lambda course: current in preReqs[course], preReqs)
nextCourses
for nxt in nextCourses:
# mark current as done
preReqs[nxt].discard(current) if not preReqs[nxt]:
eligibleCourses.append(nxt)# if we've satisfied the requirements, we're eligible to enroll now
return transcript # magna cum laude, baby
This isn’t syntactically valid Python
, and there are optimizations we can
make. But I think it illustrates the algorithm pretty well. At any point
in time, we can only take courses that we’re eligible to enroll in. The
relative order of courses doesn’t matter if they’re at the same “depth”
within the graph. For instance, we can take BIO120
at the same time as CHM136
since neither of them rely on any
prerequisites, so it doesn’t matter which of the two comes first in our
transcript. For courses that directly depend on these, however, we can
only take them after we’re finished. How are you supposed to know
understand glycolysis if you don’t even know what electrons are?
Now that we know the theory behind topological sorting, we can add it
to our toolkit. Luckily, you won’t have to implement tsort
by hand very often, because real engineers avoid code duplication at all
costs. Here’s an easy solution to the Leetcode problem:
from subprocess import check_output, DEVNULL
def courseSchedule(numCourses: int, prerequisites: List[List[int]]) -> List[int]:
= [" ".join(map(str, reversed(pair))) for pair in prerequisites]
rev_pair_strs = bytes("\n".join(rev_pair_strs), 'UTF-8')
formatted
= check_output(["tsort"], input=formatted, stderr=DEVNULL)
tsorted
return list(map(int, tsorted.splitlines()))
Beautiful.
It turns out that tsort
isn’t used in ld
anymore (grep
ped for it just now), but it’s a cool utility
that can come in handy. Don’t listen to O’Reilly
and their pessimism. Next time you run into a problem where you have to
consider the relationships between objects within a category, see if you
can reduce it to graph traversal. This is usually my go-to intuition
when I’m doing Leetcode problems, but the site is biased towards
graph-type questions anyway.