Large Language Model: Difference between revisions
No edit summary |
No edit summary |
||
| Line 86: | Line 86: | ||
| [[Llama 3]] / [[Llama 4]] || Meta || 8B–405B+ || 2024–2025 || Open weights | | [[Llama 3]] / [[Llama 4]] || Meta || 8B–405B+ || 2024–2025 || Open weights | ||
|- | |- | ||
| [[Grok|Grok]] | | [[Grok|Grok]] || xAI || Various || 2023–2026 || Built for maximum truth-seeking and humor | ||
|- | |- | ||
| [[Gemini|Gemini]] || Google || Various || 2023–2025 || Deep integration with Google ecosystem | | [[Gemini|Gemini]] || Google || Various || 2023–2025 || Deep integration with Google ecosystem | ||
Latest revision as of 02:05, 27 March 2026
Template:Short description Template:Infobox artificial intelligence
A large language model (LLM) is a type of artificial neural network trained on vast amounts of text data to understand, generate, and manipulate human language. LLMs are a core technology behind modern generative artificial intelligence systems such as ChatGPT, Claude, Grok, and Gemini.
History
[edit]The foundations of large language models trace back to early statistical language models and recurrent neural networks (RNNs). Key milestones include:
- 2017: The seminal paper ["Attention Is All You Need"](https://arxiv.org/abs/1706.03762) by Ashish Vaswani and colleagues at Google introduced the transformer architecture, which replaced recurrent layers with self-attention mechanisms, enabling much better parallelization and scaling.<ref name="transformer">Template:Cite journal</ref>
- 2018: OpenAI released GPT-1, followed by GPT-2 in 2019, demonstrating the power of scaling up transformer-based models.
- 2020: GPT-3 with 175 billion parameters showed emergent abilities such as few-shot learning, sparking widespread public interest.
- 2022–2023: The release of ChatGPT (based on GPT-3.5 and later GPT-4) brought LLMs into mainstream use. Open-source models like Meta's Llama series and Mistral AI models democratized access.
- 2024–2026: Continued scaling with multimodal models (text + image + audio), longer context windows (millions of tokens), and reasoning-focused architectures.
Architecture
[edit]Most modern LLMs are based on the decoder-only transformer architecture:
- Self-attention mechanism that allows the model to weigh the importance of different words in a sequence.
- Feed-forward neural networks applied at each position.
- Layer normalization and residual connections for stable training.
- Positional encoding (or rotary embeddings like RoPE) to handle sequence order.
Key variants include:
- Encoder-decoder (e.g., original T5, BART)
- Decoder-only (most popular for generative tasks: GPT, Llama, Grok, Mistral)
- Mixture-of-Experts (MoE) architectures (e.g., Mixtral, Grok-1) that activate only a subset of parameters per token for efficiency.
Training
[edit]LLMs undergo two main training phases:
Pre-training
[edit]- Objective: Next-token prediction (causal language modeling) or masked language modeling.
- Data: Trillions of tokens from web crawls (Common Crawl), books, Wikipedia, code repositories, scientific papers, and more.
- Compute: Trained on thousands of GPUs/TPUs for weeks or months using massive distributed training frameworks.
Post-training (alignment)
[edit]- Supervised fine-tuning (SFT) on high-quality instruction datasets.
- Reinforcement Learning from Human Feedback (RLHF) or alternatives like Direct Preference Optimization (DPO) to make outputs more helpful, honest, and harmless.
Capabilities
[edit]Large language models can perform a wide range of tasks:
- Text generation, summarization, translation, and rewriting
- Question answering and knowledge retrieval
- Code generation and debugging
- Mathematical reasoning (improved in recent models)
- Creative writing, role-playing, and conversation
- Multimodal understanding (in models like GPT-4o, Gemini, Claude 3)
Emergent abilities appear as models scale: abilities not explicitly trained for but that arise at certain parameter thresholds.
Limitations and Challenges
[edit]- Hallucinations: Generating plausible but factually incorrect information.
- Context window limits (though rapidly expanding to 1M+ tokens).
- Bias and toxicity inherited from training data.
- High computational cost for training and inference.
- Lack of true understanding — models predict patterns rather than comprehend meaning.
- Reasoning limitations: Struggle with complex multi-step problems without techniques like chain-of-thought prompting.
Notable Models
[edit]| Model | Developer | Parameters | Release | Notes |
|---|---|---|---|---|
| GPT-4 | OpenAI | Undisclosed (~1.7T rumored) | 2023 | Multimodal, strong reasoning |
| Claude 3.5 Sonnet | Anthropic | Undisclosed | 2024–2025 | Known for safety and coding |
| Llama 3 / Llama 4 | Meta | 8B–405B+ | 2024–2025 | Open weights |
| Grok | xAI | Various | 2023–2026 | Built for maximum truth-seeking and humor |
| Gemini | Various | 2023–2025 | Deep integration with Google ecosystem | |
| Mistral Large / Mixtral | Mistral AI | Various | 2023–2025 | Efficient open models |
Societal Impact
[edit]LLMs have transformed industries including:
- Software development (GitHub Copilot, Cursor)
- Education and research assistance
- Content creation and customer service
- Scientific discovery (e.g., AlphaFold integration, materials science)
Concerns include:
- Job displacement in writing, coding, and analysis roles
- Misinformation and deepfakes
- Intellectual property and copyright issues
- Existential risk debates regarding artificial general intelligence
Ethical and Safety Considerations
[edit]Major labs implement various safety measures:
- Constitutional AI (Anthropic)
- System prompts and guardrails
- Red teaming for adversarial testing
- Watermarking and detection tools for AI-generated content
See also
[edit]- Transformer (machine learning model)
- Generative pre-trained transformer
- Artificial general intelligence
- Prompt engineering
- AI alignment
References
[edit]External links
[edit]- "Attention Is All You Need" — foundational transformer paper
- GPT-4 Technical Report
- Various model cards on Hugging Face