Most public language models are trained on vast amounts of data, which is “scraped” from all the data that is fed into the model.

Every time you send something to OpenAI’s ChatGPT, for instance, OpenAI uses your data to “train” its model. The intent is to provide a broader knowledge base, which may be more useful for most common interactions with ChatGPT.

For some uses, a more specific language model may be needed, which is trained on a narrower set of data. E.g., a law office may choose to train its own language model on the kind of data it primarily works with. This will improve the responses that come from the language model, since it reduces the risk of getting unrelated information. Yes, even language models can get confused, or don’t understand the specific context of a question. Training your own language model can help.