Published: June 7, 2022
30
95
612

I am endlessly fascinated with content tagging systems. They're ubiquitous in software and have so many nuances, but I can't find anything on how to design and implement anymore more than the barebones basics of a system. Some thoughts in a thread.

A tag is a metadata label associated with content, primarily used for querying and grouping. The tag name is also the id: two tags with the same name are the same tag. Tags appear everywhere: #hashtags, wikipedia categories, blog post labels, AWS infra tags...

Now, are `horse` and `horses` the same tag? They're different strings, but I'd be pretty miffed if I queried for `horse` and got only half the data. So for serious querying we need some kind of relationship between tags

The simplest relationship is tag aliases: if A is aliased to B, then querying A is identical to querying B. The only system I know that does that is the fanfiction site AO3, where teams of volunteers manually create aliases from, say, "snarry" to "Harry/Snape"

Any additional structure, though, raises questions about both design and use. Should users be able to intentionally query a specific tag alias? There's no correct design choice here, depends on your users. AO3 went with "no." A harder problem: tag hierarchies, or "subtags".

With subtags, we can query "science" and get everything tagged "physics", "quantum physics", and "quantum computing". Usage questions: * Can things be tagged with root tags or just leaf tags? * How do users search just X, not its subtags?

There's also implementation considerations. First, transitive queries are expensive, how do you optimize them? Second, how do you prevent cycles in the tag hierarchy, where A and B are both transitive subtags of each other?

It gets even more complex if tags can have multiple parents, like Wikipedia categories. "American Male Novelists" is a subtag of "American Male Writers" and "American Novelists". Now we have diamond problems, redundancy, a whole host of other edge cases.

Notice that the more richness in structure we add, the more ambiguous our queries become. "Every article that shares a tag with article X that's not also in article Y" is unambiguous with simple tags, less so with a tag tree, even less so with a tag directed acyclic graph

You also see "smart" tags, which are based on a predicate. One mobile device manager (MDM) I worked with allowed for things like "every iPad assigned to a classroom in Ferndale High School is considered to be tagged FHS". Advice: don't let the tag predicates refer to other tags

At one point we accidentally added two smart tags: A: "Anything tagged B" B: "Anything not tagged A" And then the MDM crashed. We also saw major performance issues, where every content change forced all the smart tags to recompute, which was incredibly expensive.

One last type of tag structure: "key-value" tags, ie `priority: 1`. Then searching for `priority` will get you all content with a `priority` tag, while `priority: 1` gets you that specific subset. Lots of project management tooling has key-value tags, as does AWS

Hypothetically you could have richer queries with key-value tags, like `start-date: BEFORE 2022-03-01`, but I haven't seen that available in any systems. You either search no values or a specific enumerated set.

Another interesting thing about tag systems: who creates the tags, and who are they for? In most systems all users can create arbitrary tags. For distributed platforms, like the semantic web and social media, this encourages spammers to add lots of unrelated tags

But even good actors go heavy on the tags, because nobody's curating any of the tags and you have no idea which ones people use. I call this the Instagram Problem #tag #tags #tagging #label #labels #tagsystems #taxonomies #folksonomy #hashtags #taggerlyfe #taggerlifestyle

I should note that software engineers aren't the first people to deal with these kinds of questions, which are ubiquitous in library and information sciences. I imagine they have a lot of really good theory and case studies on tagging systems, but I haven't been able to find it

@hillelogram Hi! We might be the solution to your problem.

@hillelogram Postgres "LTREE" data type and associated query operators are an absolute godsend for this You can write a recursive CTE query that traverses an LTREE node hierarchy, and use a combination of custom alias rules, PG trigram/Levenshtein distance to build a simple but robust system

@hillelogram A little known fact is that Gutenberg was the first inventor of a tag system. It was so successful that it became an integral part of the German language. Today, you can hear Germans pay tribute to it whenever they meet one another. They simply give each other the: Guten Tag.

@hillelogram tags aren't implemented well anywhere because the core concept is weak I wrote^H^H^H^H^Hranted about this a while back http://forgedinstars.blogspot....

@hillelogram Human understanding is fuzzy and high dimensional, and is best modeled as a metric space (everything is a 'point' with a 'distance' to everything else). 'tagging' -- which is discrete -- is an optimization to get around metric spaces being too expensive to calculate

@hillelogram This whole thread is like a transcript of me after a few glasses of wine with the rest of the people around the campfire in mixtures of awkward and tolerant silence.

@hillelogram One way I think about it is, if you have key-value pair style tags, you basically are implementing entity-attribute-value. That kind of data can be in a triple store, and advanced queries could be semantic web queries, datalog, etc.

@hillelogram I suspect semantic hashing/search is probably the best way to solve the problems tagging tries to solve - doesn't require maintenance of tag ontology, lets you find documents that don't use the exact word but are v similar although the generative models need a lot of compute

@hillelogram Isn't this "just" the expression problem? In the sense that performing a query is execution of the search function with wildy different inputs. replace "tag" with type, and it looks like all the problems of type hierarchies to me (including cycles)

@hillelogram You should look at WordPress' data structures. Ignore the taxonomy stuff (which is tags in the traditional sense), but the Post:PostMeta relationship is implemented as AV and there's a system for querying it. It's the basis of some power features + plugins.

@hillelogram Also, tags can be multilingual, and one can pick which languages do you want to be returned when you query. Designing a tag system is really exceptionally open space and I've always felt it gives products an amazing customer value back.

@hillelogram When urged to scratch our back we search for anything with the _attributes_ suited for the task. Classifications are contingent for us, thus any system forcing a classification i.e. hidding attributes quicky becomes a problem. Love this from Berkeley's Theory of Vision (hate OOP)

Image in tweet by Inactive; Bluesky is @hillelwayne(dot)com

@hillelogram Related:

@hillelogram hi, please review @tana_inc and let us know

@hillelogram One approach I do see being actively researched treats tags as “syntax” to which we want to usefully assign meaning/“semantics” in a nicely behaved (functorial) way.

@hillelogram HotCRP has some additional wrinkles regarding tags, such as tags that only some people can see (due to either conflicts or permissions) and private tags that a user can set for themselves and other users can't see at all.

@hillelogram Look at Advanced Query Syntax: https://learn.microsoft.com/en...

Share this thread

Read on Twitter

View original thread

Navigate thread

1/37