Teaching AI Your Domain — Where Does Your Corporate Data Actually Go?
Teaching AI Your Domain — Where Does Your Corporate Data Actually Go?
The argument flows naturally from the premise.
"If AI understands the domain, the people who currently hold that domain knowledge become redundant."
For organizations, this logic is attractive. Instead of maintaining high-salary domain experts, you use AI. Legal teams. Medical consultants. Financial analysts. Senior engineers with two decades of institutional knowledge. If AI has absorbed what they know, their positions become harder to justify.
But there's a hidden premise in this logic.
For AI to understand your domain, someone has to feed it that domain.
What happens during that feeding process is something almost nobody discusses carefully.
AI Doesn't Learn on Its Own
This is widely misunderstood.
When we say ChatGPT or Claude "understands law" or "knows medicine," we're describing the result of pre-training on massive amounts of publicly available text. Published legal documents, medical research papers, financial disclosures — these went into the training data. That's why these models have general domain knowledge at a surface level.
But internal organizational knowledge is different.
A financial institution's forced liquidation trigger thresholds are not published anywhere. A specific hospital's patient intake workflow isn't in a GitHub repository. A manufacturer's defect classification criteria haven't appeared in research papers. AI's pre-training reaches as far as public information. The knowledge any organization has accumulated over decades — its internal policies, operational rules, edge-case handling — sits entirely outside AI's baseline capabilities.
So when someone says "we'll teach AI our domain," they're necessarily describing this process: feeding internal documents, policies, processes, and rules into an AI system. Building a RAG pipeline that connects internal data as a retrieval source. Fine-tuning a model with organization-specific knowledge. Or attaching relevant internal documents to every prompt as context.
Every one of these paths routes internal information through an external system.
That's where the risk begins.
Samsung Found Out First
In April 2023, Samsung Electronics disclosed that confidential internal information had leaked through ChatGPT. Three separate incidents within roughly twenty days: an employee pasted semiconductor equipment-related source code into ChatGPT to request code improvements; another submitted meeting transcripts for summarization. The details that came out were limited, but the fact of the leaks was not disputed.
Samsung responded by restricting internal AI usage and pivoting toward building internal AI tools. And Samsung wasn't alone. The same period saw similar incidents reported across multiple organizations. Goldman Sachs, JPMorgan, and Apple all issued formal internal bans on ChatGPT usage around the same time.
These incidents reveal something more fundamental than individual employee mistakes.
The structure itself is fragile. Asking employees to judge, in every interaction, "how much of this is safe to share?" creates a decision burden that compounds over time. And AI's utility creates a pressure in exactly the wrong direction: the more specific the context you provide, the better the output. Better code review requires more code. Better meeting summaries require more detail. The incentive to share more is embedded in the tool's design.
Security policies struggle to contain this. Technically blocking copy-paste through a browser window is not realistically achievable. Data Loss Prevention systems can catch some patterns, but free-text entry into a chat interface remains difficult to monitor comprehensively at the content level.
What Does the Cloud Do With Your Data?
When you submit a prompt, where does that data go?
The official position from major providers is consistent: enterprise API contracts mean your inputs aren't used for model training. OpenAI, Anthropic, Google — all have policies along these lines. Input data under business API agreements is not used to improve the model.
The problem is verification.
There's no external way to confirm this. The arrangement requires trusting the company's statement. Not that these companies are lying — but depending on an unverifiable promise to protect competitive advantage is a structurally weak position regardless of the trustworthiness of the party making the promise. Even with contractual guarantees, data passing through external servers gets logged somewhere, cached during inference, recorded in monitoring systems — and none of that is fully within the control of the organization sending the data.
Then there's breach risk. Large AI services are targets. In March 2023, OpenAI disclosed an incident in which a subset of ChatGPT Plus users had payment information and portions of conversation history exposed. The scope was limited, but the underlying fact doesn't change: data that lives on cloud infrastructure can be exposed when that infrastructure is compromised.
The decision to have AI understand your domain is simultaneously a decision to put your domain information on external servers. That risk needs to be made explicit and deliberately accepted — not quietly absorbed in the name of productivity improvement.
What About Running It Locally?
The obvious counterargument:
"If cloud is the risk, run a local model. On-premises deployment means data never leaves the building."
This is correct, and organizations with serious security requirements — financial institutions, healthcare systems, defense contractors, government agencies — are taking this direction seriously. Some have already deployed internal models.
But reality introduces friction here too.
Running a model at cloud-frontier quality on-premises requires substantial hardware. Getting close to GPT-4-level performance from an open-source model requires high-end GPU servers — infrastructure investment in the hundreds of thousands to millions of dollars range depending on scale. Mid-sized organizations face this as a significant capital decision. Even large enterprises need clear utilization rates and ROI cases to justify it.
Performance remains a gap. The models that fit into on-premises deployments today trail cloud frontier models — particularly on complex domain reasoning and long-context coherence. Preserving security by running locally means accepting meaningful performance compromises.
And maintenance creates its own overhead. Cloud services handle model updates, infrastructure management, and security patching on the provider's side. On-premises deployment moves all of that burden internal. You need ML engineers. GPU infrastructure management. A cadence for model updates and capability evaluation. The goal was replacing domain experts with AI; the result is needing new experts to maintain the AI infrastructure.
The Contradiction in "We Don't Need That Person Anymore"
Return to the original premise.
"Once AI understands the domain, the domain expert is no longer necessary."
For this to hold, three things must be true simultaneously: AI must actually understand the organization's specific domain. The process of teaching AI that domain must carry no security risk. Maintaining AI's domain understanding must require no additional specialized personnel.
None of these three conditions currently hold.
AI knows general domains but not organization-specific ones — established earlier. The security risks of the teaching process are real — just covered. The third: maintaining a domain-aware AI system is not free.
Say you've built a RAG pipeline. Works well at first. Then internal policy changes. A new regulation takes effect. A legacy system gets replaced. Each change requires updating the pipeline, re-indexing the vector database, adjusting prompts, validating output quality. Who does this work? Someone who understands the domain. The process of maintaining AI's domain awareness doesn't eliminate the domain expert — it transforms them into a translator between the domain and the AI system.
So "we don't need that person anymore" sounds right but isn't. The shape of the needed role changes; the need itself doesn't disappear. The person who used to analyze and decide directly now designs and manages the system that does the analyzing and deciding. The work changes character; it doesn't vanish.
Domain Experts Become Gatekeepers
In organizations where AI adoption has matured, a consistent pattern appears.
AI doesn't eliminate specific job categories — it splits them. Among people doing a given type of work, those who can operate effectively with AI separate from those who cannot. When legal teams adopt AI contract review tools, people without legal knowledge don't start reviewing contracts; lawyers review AI output and make judgment calls. When hospitals deploy diagnostic AI, physicians don't disappear; they use AI analysis as reference and retain final decision authority.
This pattern repeats for a structural reason: evaluating AI output quality requires the domain knowledge needed to produce that output.
If AI reviews a contract and flags a clause as risky, determining whether that flag is accurate requires contract law knowledge. If AI analyzes patient symptoms and suggests a diagnosis, assessing the plausibility of that assessment requires medical expertise. If AI reviews code and identifies a potential performance issue, knowing whether that issue is meaningful in this system's actual usage patterns requires domain knowledge about the service.
Validating AI output becomes the new bottleneck. Domain experts are the only people who can clear that bottleneck.
Paradoxically, AI adoption often strengthens domain experts' organizational position. AI generates more drafts, creating more things to validate. Contracts that weren't reviewed because legal review was expensive now get reviewed because AI drafting is cheap — creating more demand for legal experts to check the AI's work. Volume for domain experts increases.
What Organizations Actually Need to Figure Out
Companies seriously evaluating AI adoption need to be working through questions like these:
When employees enter our internal information into external AI services, where is the line between acceptable and prohibited? How do we give employees clarity on that line in everyday practice? What technical controls can enforce it? Is there a category of information that requires on-premises deployment, and can we justify the cost and performance tradeoff? If we build AI systems that reflect our domain knowledge, who maintains them and in what role?
None of these questions get answered by AI. Organizations have to decide. And making these decisions well requires people who understand both the technology and the domain to be part of the process.
"AI understands the domain, so we don't need the people" skips this entire process. Teaching AI to understand the domain. Managing the risks of that teaching. Maintaining the understanding over time. Validating the outputs. Every stage requires people, and those people need to know the domain.
AI Learns From Corporate Secrets
There's a final structural observation worth sitting with.
What's most valuable to AI service companies? As more organizations feed more internal data into AI systems — even if that data isn't used directly for model training — some form of value flows upward. Which types of questions arrive most frequently. What context makes outputs useful. Which domains are growing in usage. This meta-level information alone allows service providers to refine models and steer product development with remarkable precision.
"We don't use it for training" may be entirely sincere. The question of who benefits from the systematic flow of organizational domain knowledge into external platforms is still worth asking.
When a company decides to teach AI its domain, it's placing its competitive advantage on an external platform. That platform can change pricing, shift policies, or shut down. Domain knowledge that lives inside employees stays with the company as long as those employees do. Domain knowledge encoded into AI pipeline configurations is tied to the continuity of the external service those pipelines depend on.
This is a risk for every organization that undertakes it. Knowing that risk and accepting it deliberately is a different thing from absorbing it invisibly in the name of productivity.
"Teaching AI our domain" involves more than the phrase suggests.
Decisions about what internal information goes where. Structures to control the scope of that exposure. The probability of leakage incidents. The cost and performance limits of on-premises alternatives. The transformation of domain experts' roles. The risks of deep dependency on external platforms.
All of this is packed inside the phrase "AI adoption."
Organizations that treat these complications as something AI will figure out are going to encounter them the hard way. Usually after something breaks.
댓글
댓글 쓰기