{"id":131,"date":"2023-07-08T07:36:00","date_gmt":"2023-07-08T07:36:00","guid":{"rendered":"https:\/\/themenectar.com\/salient\/mag\/?p=131"},"modified":"2026-03-27T09:06:10","modified_gmt":"2026-03-27T09:06:10","slug":"lessons-learned-from-professional-challenges","status":"publish","type":"post","link":"https:\/\/curriculo.me\/engineering\/lessons-learned-from-professional-challenges\/","title":{"rendered":"The Architecture of Information: How I Learned to Make Messy Data Readable"},"content":{"rendered":"\n<p>Not long ago, my system hit a breaking point.<\/p>\n\n\n\n<p>It wasn\u2019t a crash; it was a <strong>trust problem<\/strong>. I watched a client dataset get ingested, and instead of a clean structure, the system created a mess of duplicates. The same organization appeared three times in the knowledge graph because the source data used three different naming conventions: <em>\u201cMcKinsey &amp; Company,\u201d<\/em> <em>\u201cMcKinsey and Company,\u201d<\/em> and <em>\u201cMckinsey.\u201d<\/em> To a human, it\u2019s obviously one entity. To a graph, it was three separate nodes with three disconnected sets of relationships. Three parallel versions of the same truth.<\/p>\n\n\n\n<p>That was the moment I stopped thinking about data as something you simply store and started thinking about it as something you <strong>architect<\/strong>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Everything Is Messier Than the Schema Suggests<\/strong><\/h3>\n\n\n\n<p>Here\u2019s what nobody prepares you for when you build systems that process human-generated data: humans are wildly inconsistent. And I don\u2019t mean that in a charming way; I mean it in a way that breaks every assumption your schema makes.<\/p>\n\n\n\n<p>One person writes &#8220;Python.&#8221; Another writes &#8220;Python Programming Language.&#8221; A third writes &#8220;python&#8221; in lowercase. To a naive string comparison, these are four completely different entities. Four nodes. Four lies in your graph.<\/p>\n\n\n\n<p>I didn\u2019t set out to solve this problem; I just wanted to build a clean data pipeline. But when your pipeline ingests human data, this mess <em>is<\/em> the problem. Everything else is downstream.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Visual Hierarchy as an Engineering Protocol<\/strong><\/h3>\n\n\n\n<p>Before I go deep on the technical architecture, let me make a case for something engineers don\u2019t talk about enough: <strong>how information looks matters as much as what it contains.<\/strong><\/p>\n\n\n\n<p>When you glance at a well-structured report or dashboard, your brain builds a mental model in under three seconds. You instinctively know where the key data lives and the relative importance of each section. This isn&#8217;t &#8220;design&#8221;\u2014it&#8217;s a <strong>data communication protocol.<\/strong><\/p>\n\n\n\n<p>The way you present data shapes the decisions people make from it. When I\u2019m building a system to structure information, I can\u2019t just think about schema correctness. I have to think about <strong>perceptual weight.<\/strong> What does a human\u2019s eye land on first? How do I encode those priorities into a data structure that a rendering engine can actually act on?<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Knowledge Graphs: Relationships Are the Data<\/strong><\/h3>\n\n\n\n<p>A database table stores facts. A knowledge graph stores <strong>meaning<\/strong>.<\/p>\n\n\n\n<p>In my graph, every piece of information is a node, but the actual intelligence lives in the <strong>edges<\/strong>:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>HAS_ATTRIBUTE: Links an entity to its specific, contextual properties.<\/li>\n\n\n\n<li>BELONGS_TO: Associates a record with its parent organization, which has its own properties and relationships.<\/li>\n\n\n\n<li>REQUIRES: Connects a project or record to the specific technologies or skills involved.<\/li>\n<\/ul>\n\n\n\n<p>When data is linked this way, it isn&#8217;t just a list; it\u2019s a <strong>narrative<\/strong>. It emerges naturally from the structure\u2014I don\u2019t have to write special &#8220;narrative generation&#8221; code. The relationships <em>are<\/em> the story.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Data Density: The Constraint is the Feature<\/strong><\/h3>\n\n\n\n<p>I\u2019m obsessed with <strong>Data Density<\/strong>\u2014meaningful information per unit of visual space. A dense document isn\u2019t a cluttered one; it\u2019s one where every element earns its position. In my system, density is enforced through strict structural constraints:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>items_per_category: 5: You don&#8217;t list 47 attributes. You list the five that matter. The constraint forces prioritization.<\/li>\n\n\n\n<li>summary_max_sentences: 2: Two sentences. That\u2019s it. Say what it is and why it matters. Everything else is noise.<\/li>\n\n\n\n<li>forbidden_filler_phrases: I wrote actual validation logic that rejects empty words like <em>&#8220;best-in-class&#8221;<\/em> or <em>&#8220;synergy.&#8221;<\/em> They take up space and communicate nothing.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>The Standardization Stack: The Plumbing Nobody Sees<\/strong><\/h3>\n\n\n\n<p>To solve that &#8220;McKinsey&#8221; problem, I built a four-layer stack to ensure data integrity:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Normalization:<\/strong> Cleaning names on ingestion and stripping verbose suffixes.<\/li>\n\n\n\n<li><strong>Fuzzy Deduplication:<\/strong> Using a 0.82 similarity threshold\u2014high enough to catch &#8220;Docker Container&#8221; and &#8220;Docker,&#8221; but low enough to keep &#8220;Go&#8221; and &#8220;Git&#8221; separate.<\/li>\n\n\n\n<li><strong>Deterministic IDs:<\/strong> Every node gets an ID derived from its content (e.g., attr_python). Process the same messy input twice, get the same clean graph once.<\/li>\n\n\n\n<li><strong>The Mutation Engine:<\/strong> A dispatch system that handles updates and corrections without losing the provenance of where the data came from.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>What I Learned from Failure<\/strong><\/h3>\n\n\n\n<p>This system has failed in ways that taught me more than the successes. It once merged &#8220;Machine Learning&#8221; and &#8220;Machine Operation&#8221; because they shared a high similarity score. That failure taught me the massive gap between &#8220;structurally valid&#8221; and &#8220;semantically meaningful.&#8221;<\/p>\n\n\n\n<p>I also learned to treat automated extraction as <strong>untrusted input.<\/strong> I now use noise-pattern filters to ensure the system doesn&#8217;t &#8220;hallucinate&#8221; records that sound plausible but don&#8217;t exist.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>The Bottom Line<\/strong><\/h3>\n\n\n\n<p>The architecture of information is the <strong>architecture of trust.<\/strong> When data is normalized and semantically linked, people trust the output. When it\u2019s full of orphans and duplicates, the substance doesn&#8217;t matter.<\/p>\n\n\n\n<p>The job isn&#8217;t just to write code; it&#8217;s to turn chaos into something queryable. If you want to turn noise into knowledge, you don&#8217;t just need storage. You need architecture<\/p>\n\n\n\n<figure class=\"wp-block-gallery has-nested-images columns-default is-cropped wp-block-gallery-1 is-layout-flex wp-block-gallery-is-layout-flex\"><\/figure>\n\n\n\n<p><script type=\"application\/ld+json\">{\"@context\": \"https:\/\/schema.org\", \"@type\": \"Article\", \"headline\": \"Lessons Learned from Professional Challenges\", \"url\": \"https:\/\/curriculo.me\/engineering\/lessons-learned-from-professional-challenges\/\", \"isAccessibleForFree\": true, \"author\": {\"@type\": \"Person\", \"name\": \"Dev\", \"url\": \"https:\/\/curriculo.me\/about-us\/\"}, \"publisher\": {\"@type\": \"Organization\", \"@id\": \"https:\/\/curriculo.me\/#organization\", \"name\": \"Curriculo\", \"url\": \"https:\/\/curriculo.me\/\", \"logo\": {\"@type\": \"ImageObject\", \"url\": \"https:\/\/curriculo.me\/wp-content\/uploads\/2026\/03\/cropped-Curriculo.png\"}}, \"datePublished\": \"2023-07-08T07:36:00\", \"dateModified\": \"2023-07-08T07:36:00\", \"mainEntityOfPage\": {\"@type\": \"WebPage\", \"@id\": \"https:\/\/curriculo.me\/engineering\/lessons-learned-from-professional-challenges\/\"}, \"inLanguage\": \"en-US\"}<\/script><script type=\"application\/ld+json\">{\"@context\": \"https:\/\/schema.org\", \"@type\": \"BreadcrumbList\", \"itemListElement\": [{\"@type\": \"ListItem\", \"position\": 1, \"name\": \"Home\", \"item\": \"https:\/\/curriculo.me\/\"}, {\"@type\": \"ListItem\", \"position\": 2, \"name\": \"Engineering\", \"item\": \"https:\/\/curriculo.me\/engineering\/\"}, {\"@type\": \"ListItem\", \"position\": 3, \"name\": \"Lessons Learned from Professional Challenges\"}]}<\/script><\/p>\n","protected":false},"excerpt":{"rendered":"<p>What happens when a Data Scientist decides that &#8220;unstructured&#8221; is a personal insult<\/p>\n","protected":false},"author":2,"featured_media":867,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[6,12,9],"tags":[18,19],"class_list":{"0":"post-131","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-growth","8":"category-software","9":"category-tech","10":"tag-data-science","11":"tag-data-structure"},"_links":{"self":[{"href":"https:\/\/curriculo.me\/engineering\/wp-json\/wp\/v2\/posts\/131","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/curriculo.me\/engineering\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/curriculo.me\/engineering\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/curriculo.me\/engineering\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/curriculo.me\/engineering\/wp-json\/wp\/v2\/comments?post=131"}],"version-history":[{"count":5,"href":"https:\/\/curriculo.me\/engineering\/wp-json\/wp\/v2\/posts\/131\/revisions"}],"predecessor-version":[{"id":870,"href":"https:\/\/curriculo.me\/engineering\/wp-json\/wp\/v2\/posts\/131\/revisions\/870"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/curriculo.me\/engineering\/wp-json\/wp\/v2\/media\/867"}],"wp:attachment":[{"href":"https:\/\/curriculo.me\/engineering\/wp-json\/wp\/v2\/media?parent=131"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/curriculo.me\/engineering\/wp-json\/wp\/v2\/categories?post=131"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/curriculo.me\/engineering\/wp-json\/wp\/v2\/tags?post=131"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}