SQL (Structured Query Language) is a standard language for managing and manipulating relational databases. It's divided into five types: DDL (Data Definition Language) for schema definition like CREATE, ALTER, DROP; DML (Data Manipulation Language) for data manipulation like SELECT, INSERT, UPDATE, DELETE; DCL (Data Control Language) for permissions like GRANT, REVOKE; TCL (Transaction Control Language) like COMMIT, ROLLBACK; and DQL (Data Query Language) primarily SELECT statements. SQL is declarative, meaning you specify what data you want, not how to get it.
2. What is the difference between WHERE and HAVING clause?
WHERE filters rows before grouping and cannot use aggregate functions, operating on individual rows. HAVING filters groups after aggregation and works with aggregate functions like COUNT, SUM, AVG. WHERE is used with SELECT, UPDATE, DELETE statements, while HAVING is used specifically with GROUP BY. For example, WHERE filters employees before counting, while HAVING filters departments after counting employees. WHERE executes first in query processing order, followed by GROUP BY, then HAVING.
3. Explain the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL JOIN.
INNER JOIN returns only matching rows from both tables based on the join condition. LEFT JOIN (LEFT OUTER JOIN) returns all rows from the left table and matched rows from the right, with NULLs for unmatched right rows. RIGHT JOIN returns all rows from the right table and matched rows from the left, with NULLs for unmatched left rows. FULL JOIN (FULL OUTER JOIN) returns all rows from both tables, with NULLs where there's no match. Choose based on whether you need all records from one or both tables.
4. What is a Primary Key and Foreign Key?
A Primary Key uniquely identifies each row in a table, cannot contain NULL values, and each table can have only one primary key (single or composite columns). A Foreign Key establishes relationships between tables by referencing the primary key of another table, can contain NULL values, and enforces referential integrity. Foreign keys ensure data consistency by preventing orphaned records. A table can have multiple foreign keys, and they enable cascading updates and deletes to maintain data integrity across related tables.
5. What is the difference between DELETE, TRUNCATE, and DROP?
DELETE is a DML command that removes specific rows based on WHERE conditions, can be rolled back, triggers fire, and is slower. TRUNCATE is a DDL command that removes all rows quickly, cannot be rolled back (in most databases), doesn't fire triggers, and resets identity columns. DROP is a DDL command that removes the entire table structure and data permanently, cannot be rolled back, and removes table definition from the database. Use DELETE for selective removal, TRUNCATE for clearing tables, DROP for removing tables entirely.
6. What are constraints in SQL?
Constraints are rules enforced on table columns to ensure data accuracy and integrity. PRIMARY KEY uniquely identifies rows and prevents NULLs. FOREIGN KEY maintains referential integrity between tables. UNIQUE ensures all values in a column are distinct (allows one NULL). NOT NULL prevents NULL values. CHECK validates data against specific conditions. DEFAULT assigns default values when none provided. Constraints can be defined at column level or table level, and they're enforced automatically by the database engine.
7. What is normalization and its types?
Normalization organizes database tables to reduce redundancy and dependency, improving data integrity. First Normal Form (1NF) eliminates repeating groups and ensures atomic values. Second Normal Form (2NF) achieves 1NF and removes partial dependencies. Third Normal Form (3NF) achieves 2NF and removes transitive dependencies. BCNF (Boyce-Codd Normal Form) is stricter 3NF. Fourth Normal Form (4NF) handles multi-valued dependencies. Higher normal forms exist but are rarely used. Normalization trades storage efficiency and data integrity against query complexity and potential performance impact.
8. What is denormalization and when do you use it?
Denormalization intentionally introduces redundancy by combining tables or adding duplicate data to improve read performance and reduce complex joins. It's used in data warehousing, reporting systems, and read-heavy applications where query speed is more important than write efficiency. Common techniques include adding redundant columns, creating summary tables, and using materialized views. Denormalization increases storage requirements and update complexity but significantly improves query performance. Balance it against maintenance overhead and data consistency requirements.
๐ Advanced SQL Queries
9. What are aggregate functions in SQL?
Aggregate functions perform calculations on multiple rows and return a single value. COUNT() returns number of rows, SUM() calculates total, AVG() computes average, MIN() finds minimum value, MAX() finds maximum value. They're used with GROUP BY to calculate values for each group. Aggregate functions ignore NULL values (except COUNT(*)). They can be combined with HAVING to filter groups. Common use cases include calculating totals, averages, and statistical summaries for reporting and analytics purposes.
10. Explain GROUP BY and ORDER BY clauses.
GROUP BY organizes rows sharing common values into summary rows, typically used with aggregate functions to perform calculations per group. ORDER BY sorts the result set by one or more columns in ascending (ASC) or descending (DESC) order. GROUP BY must appear before ORDER BY in query execution order. You can group by multiple columns to create hierarchical groupings. ORDER BY executes last and can sort by columns not in SELECT. Both clauses are essential for organizing and analyzing data in meaningful ways.
11. What is a subquery and what are its types?
A subquery is a query nested inside another query, enclosed in parentheses. Single-row subqueries return one row and are used with comparison operators. Multiple-row subqueries return multiple rows and use IN, ANY, or ALL operators. Correlated subqueries reference columns from the outer query and execute once per outer row. Scalar subqueries return single values. Inline views (derived tables) appear in FROM clause. Subqueries provide flexibility but can impact performance; consider JOINs or CTEs as alternatives for better optimization.
12. What is the difference between UNION and UNION ALL?
UNION combines result sets from multiple SELECT statements, removing duplicate rows and sorting results, which adds processing overhead. UNION ALL combines results and keeps all rows including duplicates, offering better performance. Both require same number of columns with compatible data types in the same order. UNION is slower because it performs distinct operation and sort. Use UNION when duplicates must be eliminated, UNION ALL when duplicates are acceptable or known not to exist. UNION ALL is preferred for better performance when duplicates don't matter.
13. What are indexes and their types?
Indexes are database structures that speed up data retrieval by creating pointers to data locations. Clustered index determines the physical storage order of data (one per table), and the table is sorted by this index. Non-clustered index creates a separate structure with pointers to actual data rows (multiple per table). Unique index ensures column uniqueness. Composite index covers multiple columns. Full-text index supports text searching. Covering index includes all columns needed by a query. Indexes improve SELECT performance but slow INSERT, UPDATE, DELETE operations.
14. What is a view in SQL?
A view is a virtual table based on a SELECT query that doesn't store data itself but presents data from underlying tables dynamically. Views simplify complex queries, enhance security by restricting column/row access, provide data abstraction, and enable backward compatibility when schemas change. They can be used like regular tables in queries. Updatable views allow INSERT, UPDATE, DELETE under certain conditions. Materialized views (in some databases) cache results for better performance. Views are ideal for frequently used complex joins or for presenting different data perspectives.
15. What are stored procedures and their advantages?
Stored procedures are precompiled SQL code blocks stored in the database, executed with a single call. They improve performance through query plan caching and reduced network traffic. They enhance security by controlling data access and preventing SQL injection. They enable code reusability, centralized business logic, and easier maintenance. Stored procedures support parameters, variables, control flow (IF, WHILE), error handling, and transactions. They're ideal for complex operations, batch processing, and enforcing business rules. However, they can make code less portable across database systems.
16. What is the difference between stored procedures and functions?
Stored procedures can return zero or multiple values using output parameters, modify database state, and cannot be used in SELECT statements. Functions must return a single value (scalar or table), should not modify database state (in most databases), and can be used in SELECT and WHERE clauses. Procedures are called using EXEC/EXECUTE, while functions are invoked in expressions. Procedures support transactions and exception handling more robustly. Use procedures for operations that modify data, functions for calculations and data retrieval that can be embedded in queries.
17. What are triggers in SQL?
Triggers are special stored procedures that automatically execute in response to specific database events (INSERT, UPDATE, DELETE). DML triggers fire on data changes, DDL triggers on schema changes, and logon triggers on user sessions. BEFORE triggers execute before the operation (validation), AFTER triggers execute after (auditing, cascading). They access OLD and NEW row values to compare changes. Triggers enforce business rules, maintain audit trails, enforce referential integrity, and perform cascading operations. Use them cautiously as they add overhead and can cause unexpected behavior.
๐พ Performance & Optimization
18. How do you optimize SQL queries?
Analyze execution plans to identify bottlenecks like table scans or missing indexes. Create appropriate indexes on columns in WHERE, JOIN, ORDER BY clauses. Use SELECT with specific columns instead of SELECT *. Replace subqueries with JOINs when possible. Avoid functions on indexed columns in WHERE clauses. Use EXISTS instead of IN for large subqueries. Limit result sets with WHERE conditions and LIMIT/TOP. Update database statistics regularly. Consider query hints, partitioning, and denormalization for complex queries. Profile both query execution time and resource consumption.
19. What is query execution plan?
An execution plan shows how the database engine processes a query, including operations like table scans, index seeks, joins, and sort operations. It displays estimated costs, row counts, and execution order. Use EXPLAIN (MySQL/PostgreSQL) or Show Execution Plan (SQL Server) to view plans. Execution plans identify performance bottlenecks like missing indexes, inefficient joins, or table scans. They help compare alternative query approaches and validate optimization efforts. Understanding execution plans is crucial for diagnosing and improving query performance in production systems.
20. What is the difference between clustered and non-clustered index?
A clustered index determines the physical storage order of table data, with leaf nodes containing actual data rowsโonly one per table, usually on the primary key. A non-clustered index creates a separate structure with index keys and pointers to data rows, allowing multiple per table. Clustered indexes are faster for range queries and sorting. Non-clustered indexes are better for specific lookups. Tables without clustered indexes are heaps. Clustered index choice significantly impacts performance, so select carefully based on query patterns and data access requirements.
21. What is database transaction and ACID properties?
A transaction is a unit of work that executes as a single atomic operation, ensuring data integrity. ACID properties ensure reliability: Atomicity (all or nothing execution), Consistency (data remains valid before and after), Isolation (concurrent transactions don't interfere), Durability (committed changes persist even after system failure). Transactions use BEGIN, COMMIT (save changes), and ROLLBACK (undo changes). Isolation levels (Read Uncommitted, Read Committed, Repeatable Read, Serializable) balance consistency against concurrency. Proper transaction management prevents data corruption and maintains business logic integrity.
22. What are isolation levels in SQL?
Isolation levels control how transaction changes are visible to other concurrent transactions. Read Uncommitted allows dirty reads (lowest isolation). Read Committed prevents dirty reads but allows non-repeatable reads. Repeatable Read prevents dirty and non-repeatable reads but allows phantom reads. Serializable prevents all concurrency issues but has the lowest concurrency (highest isolation). Higher isolation prevents anomalies but reduces concurrency and performance. Choose based on data consistency requirements versus performance needs. Most databases default to Read Committed for balanced consistency and performance.
23. What is deadlock and how do you handle it?
Deadlock occurs when two or more transactions wait for each other to release locks, creating a circular dependency that prevents progress. Databases detect deadlocks and abort one transaction (deadlock victim) to resolve. Prevent deadlocks by accessing tables in consistent order, keeping transactions short, using appropriate isolation levels, and avoiding user interaction during transactions. Handle with retry logic in application code. Use SET DEADLOCK_PRIORITY to control victim selection. Monitor deadlock graphs to identify patterns and optimize queries or add indexes to reduce lock contention.
24. What is database locking and its types?
Locking prevents concurrent transactions from interfering with each other by controlling access to database resources. Shared locks allow multiple reads but prevent writes. Exclusive locks prevent all other access during writes. Row-level locks provide better concurrency but more overhead. Table-level locks are simpler but reduce concurrency. Page-level locks balance both. Intent locks indicate planned locking at lower levels. Optimistic locking assumes conflicts are rare and checks before commit. Pessimistic locking acquires locks immediately. Proper locking ensures data integrity while maximizing concurrency and performance.
๐ง Advanced Concepts
25. What are Window Functions (Analytic Functions)?
Window functions perform calculations across rows related to the current row without grouping them into a single output row. They use OVER clause to define partitions and ordering. Common functions include ROW_NUMBER(), RANK(), DENSE_RANK(), LAG(), LEAD(), SUM() OVER, AVG() OVER. They enable running totals, moving averages, ranking within groups, and comparing rows. Window functions are more powerful than GROUP BY for analytics as they preserve individual rows while providing aggregate context. They're essential for complex analytical queries and reporting.
26. What is a CTE (Common Table Expression)?
CTE is a temporary named result set defined within a query using WITH clause, improving readability and maintainability. CTEs can be referenced multiple times in the main query, supporting recursive queries for hierarchical data like organizational charts. They're scoped to a single statement and don't persist. CTEs simplify complex queries by breaking them into logical steps. Recursive CTEs use UNION ALL to reference themselves. CTEs are often clearer than subqueries or temporary tables. They're optimized by query engines similar to inline views but offer better code organization.
27. What is the difference between RANK(), DENSE_RANK(), and ROW_NUMBER()?
ROW_NUMBER() assigns unique sequential numbers to rows, even with ties (1,2,3,4). RANK() assigns same rank to tied values and skips subsequent ranks (1,2,2,4). DENSE_RANK() assigns same rank to ties without gaps (1,2,2,3). All are window functions used with OVER clause for ranking within partitions. Use ROW_NUMBER() for unique identifiers, RANK() for standard competition ranking with gaps, DENSE_RANK() for ranking without gaps. They're essential for top-N queries, pagination, and competitive rankings in analytical queries.
28. What is the difference between CHAR, VARCHAR, and TEXT data types?
CHAR is fixed-length, padding with spaces to specified length, offering faster performance for fixed-size data but wasting space for variable content. VARCHAR is variable-length up to specified maximum, storing only actual characters plus length overhead, ideal for variable-length strings. TEXT stores large variable-length strings without specified maximum (implementation varies by database). CHAR is best for codes or fixed formats like state abbreviations. VARCHAR suits names, emails, and most text fields. TEXT is for large content like descriptions or documents. Choice affects storage, performance, and indexing capabilities.
29. What is SQL injection and how do you prevent it?
SQL injection is a security vulnerability where attackers insert malicious SQL code through application inputs to manipulate database queries. Prevent it by using parameterized queries (prepared statements) which separate SQL code from data. Never concatenate user input directly into SQL strings. Use stored procedures with parameters. Implement input validation and sanitization. Apply principle of least privilege for database accounts. Use ORM frameworks that handle parameterization. Enable database firewalls and monitor for suspicious queries. Regular security audits and code reviews help identify vulnerabilities before exploitation.
30. What is the difference between database and schema?
A database is the highest-level container storing all data, users, and objects for an application or system. A schema is a logical collection of database objects (tables, views, procedures) within a database, organizing objects by ownership or functionality. In SQL Server, schema is a namespace separate from users. In MySQL, schema and database are synonymous. In PostgreSQL and Oracle, schemas organize objects within databases. Schemas enable multiple teams to work in the same database without naming conflicts, support security boundaries, and help organize complex databases logically.