Question 1

Explain SOLID principles. Give an example of the Single Responsibility Principle.

Accepted Answer

**SOLID:**

- **S — Single Responsibility:** A class should have only one reason to change
- **O — Open/Closed:** Open for extension, closed for modification
- **L — Liskov Substitution:** Subtypes must be substitutable for their base types
- **I — Interface Segregation:** Many specific interfaces are better than one general-purpose interface
- **D — Dependency Inversion:** Depend on abstractions, not concretions

**SRP example:**

```python
# Bad — one class does too much
class UserService:
    def create_user(self, data): ...
    def send_welcome_email(self, user): ...
    def generate_report(self, user): ...

# Good — each class has one responsibility
class UserService:
    def create_user(self, data): ...

class EmailService:
    def send_welcome_email(self, user): ...

class ReportService:
    def generate_report(self, user): ...
```

SRP makes code easier to test, maintain, and reason about. Each class changes only when its specific responsibility changes.

Question 2

What are the main types of SQL joins? When do you use each?

Accepted Answer

**Main join types:**

- **INNER JOIN** — returns only rows with matching values in both tables
- **LEFT JOIN** — returns all rows from left table + matching rows from right (NULL if no match)
- **RIGHT JOIN** — returns all rows from right table + matching rows from left
- **FULL OUTER JOIN** — returns all rows from both tables (NULL where no match)
- **CROSS JOIN** — returns the Cartesian product of both tables

```sql
-- Get all users with their orders (only users who have orders)
SELECT u.name, o.total
FROM users u
INNER JOIN orders o ON u.id = o.user_id;

-- Get all users, including those without orders
SELECT u.name, o.total
FROM users u
LEFT JOIN orders o ON u.id = o.user_id;
```

**NULL behavior:** NULLs never match in join conditions (`NULL = NULL` is false in SQL). Use `IS NOT DISTINCT FROM` or `COALESCE` if you need NULL-safe comparisons.

**When to use:**
- INNER JOIN — when you only want rows that exist in both tables
- LEFT JOIN — when you want all records from the primary table regardless of matches

Question 3

What is database indexing? When should you add an index and when should you avoid it?

Accepted Answer

An **index** is a data structure (typically B-tree) that speeds up data retrieval at the cost of additional storage and slower writes.

**When to add an index:**
- Columns used frequently in `WHERE` clauses
- Columns used in `JOIN` conditions
- Columns used in `ORDER BY` or `GROUP BY`
- Columns with high cardinality (many unique values)
- Foreign key columns

**When to avoid:**
- Small tables (full scan is fast enough)
- Columns with low cardinality (e.g., boolean columns)
- Tables with heavy write operations (inserts/updates)
- Columns rarely used in queries

**Types:**
- **B-tree** — general purpose, supports range queries (`<`, `>`, `BETWEEN`), ordered
- **Hash** — exact match only (`=`), faster for equality lookups, no range support
- **Composite** — index on multiple columns, order matters (leftmost prefix rule)

```sql
CREATE INDEX idx_users_email ON users(email);
CREATE INDEX idx_orders_user_date ON orders(user_id, created_at);
```

**Trade-off:** Indexes speed up reads but slow down writes because the index must be updated on every insert/update/delete.

Question 4

How do you design a RESTful API? What makes an API RESTful?

Accepted Answer

**REST principles:**
- **Resources** identified by URLs (`/users`, `/users/1`)
- **HTTP methods** map to operations: GET (read), POST (create), PUT (full update), PATCH (partial update), DELETE (remove)
- **Stateless** — each request contains all information needed
- **Proper status codes** — 200, 201, 204, 400, 401, 403, 404, 500

**Good API design:**
```
GET    /api/users          — list users
GET    /api/users/1        — get user by ID
POST   /api/users          — create user
PUT    /api/users/1        — replace user
PATCH  /api/users/1        — update user fields
DELETE /api/users/1        — delete user
GET    /api/users/1/orders — nested resources
```

**Best practices:**
- Use plural nouns for resources (`/users`, not `/user`)
- Use query parameters for filtering (`/users?role=admin`)
- Paginate list endpoints (`?page=1&limit=20`)
- Return consistent error format with message and error code
- Use HATEOAS links for discoverability (optional)

**Versioning strategies:** URL path (`/api/v2/users`), header (`Accept: application/vnd.api.v2+json`), or query parameter (`?version=2`). URL path is most common.

Question 5

Explain the difference between authentication and authorization.

Accepted Answer

**Authentication (AuthN)** — verifying *who* you are.
**Authorization (AuthZ)** — verifying *what* you can do.

| Aspect | Authentication | Authorization |
|--------|---------------|---------------|
| Question | Who are you? | What can you access? |
| Example | Login with password | Admin vs. regular user |
| Happens | First | After authentication |
| Methods | Password, OAuth, biometrics | Roles, permissions, ACLs |

**JWT-based authentication flow:**
1. User sends credentials (email + password) to `/auth/login`
2. Server validates credentials, generates a JWT containing user ID and roles
3. Server returns the JWT to the client
4. Client stores the JWT (typically in memory or httpOnly cookie)
5. Client sends JWT in `Authorization: Bearer <token>` header with each request
6. Server verifies the JWT signature and extracts user info

**JWT structure:** `header.payload.signature` (Base64-encoded)
- Header: algorithm and token type
- Payload: claims (user ID, roles, expiration)
- Signature: ensures the token has not been tampered with

**Security considerations:** Use short expiration times, refresh tokens for long sessions, httpOnly cookies to prevent XSS.

Question 6

What is the testing pyramid? Describe each level.

Accepted Answer

The **testing pyramid** organizes tests by speed, cost, and scope:

```
        /  E2E  \         Few, slow, expensive
       /  Integ. \        Medium amount
      /   Unit    \       Many, fast, cheap
```

**Unit tests (base):**
- Test individual functions/classes in isolation
- Fast, deterministic, many of them
- Mock external dependencies
- Example: test a `calculateTotal()` function

**Integration tests (middle):**
- Test how components work together
- May involve real database, API calls between services
- Slower than unit tests, fewer of them
- Example: test an API endpoint with a real database

**End-to-end tests (top):**
- Test complete user workflows through the entire system
- Slowest, most brittle, fewest of them
- Example: test the full signup flow in a browser

**Mocking vs stubbing vs faking:**
- **Mock** — records calls and verifies interactions ("was this method called with these args?")
- **Stub** — returns predefined responses ("when called, return this value")
- **Fake** — a working simplified implementation (e.g., in-memory database instead of real one)

Question 7

What are common caching strategies? When would you use each?

Accepted Answer

**Common strategies:**

**Cache-Aside (Lazy Loading):**
- Application checks cache first, on miss reads from DB and populates cache
- Best for read-heavy workloads
- Risk: cache miss penalty, stale data

**Write-Through:**
- Write to cache and DB simultaneously
- Data is always consistent
- Slower writes, good for read-heavy + consistency needs

**Write-Behind (Write-Back):**
- Write to cache, asynchronously write to DB
- Fast writes, risk of data loss if cache fails

**TTL (Time-To-Live):**
- Data expires after a set time
- Simple, good for data that changes infrequently

**Cache levels:**
- **Application-level:** In-memory (HashMap, LRU cache)
- **Distributed:** Redis, Memcached
- **HTTP:** Browser cache, CDN, `Cache-Control` headers
- **Database:** Query cache, materialized views

**Cache invalidation approaches:**
- TTL expiration (simplest)
- Event-driven invalidation (publish on write, subscribers clear cache)
- Version keys (change key when data changes)

Phil Karlton: "There are only two hard things in computer science: cache invalidation and naming things."

Question 8

What is Docker? How does a container differ from a virtual machine?

Accepted Answer

**Docker** is a platform for building, running, and shipping applications in containers — lightweight, isolated environments that package code with all its dependencies.

**Container vs Virtual Machine:**

| Aspect | Container | Virtual Machine |
|--------|-----------|----------------|
| Isolation | Process-level (shared kernel) | Full OS (own kernel) |
| Size | Megabytes | Gigabytes |
| Startup | Seconds | Minutes |
| Overhead | Minimal | Significant |
| Portability | Very high | High |

**Image vs Container:**
- **Image** — a read-only template (blueprint) with the app, dependencies, and config
- **Container** — a running instance of an image (like an object from a class)

```dockerfile
FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --production
COPY . .
EXPOSE 3000
CMD ["node", "server.js"]
```

**Key concepts:**
- `Dockerfile` — instructions to build an image
- `docker-compose` — define and run multi-container applications
- Layers — each instruction creates a cached layer (speeds up rebuilds)
- Volumes — persist data outside the container lifecycle

Question 9

Explain the difference between SQL and NoSQL databases. When would you choose each?

Accepted Answer

**SQL (Relational):**
- Structured data in tables with rows and columns
- Fixed schema, enforced by the database
- ACID transactions (Atomicity, Consistency, Isolation, Durability)
- Examples: PostgreSQL, MySQL, SQLite

**NoSQL (Non-relational):**
- Flexible data models: document, key-value, column-family, graph
- Schema-less or flexible schema
- BASE model (Basically Available, Soft state, Eventually consistent)
- Examples: MongoDB (document), Redis (key-value), Cassandra (column), Neo4j (graph)

**Choose SQL when:**
- Data has clear relationships and structure
- You need complex queries with joins
- ACID compliance is critical (financial data, transactions)
- Data integrity and consistency are top priority

**Choose NoSQL when:**
- Schema changes frequently or is unpredictable
- Horizontal scaling is needed (distributed systems)
- High write throughput is required
- Data is hierarchical or document-like (e.g., JSON)

**Polyglot persistence:** Yes, many projects use both — for example, PostgreSQL for user accounts and transactions, Redis for caching and sessions, Elasticsearch for search.

Question 10

How do you handle errors and logging in a backend application?

Accepted Answer

**Error handling best practices:**

1. **Use a global error handler** — catch unhandled errors in one place
2. **Distinguish error types:**
   - Operational errors (expected: invalid input, not found) — handle gracefully
   - Programmer errors (bugs: null reference, type errors) — log and restart
3. **Return consistent error responses:**

```json
{
  "error": {
    "code": "VALIDATION_ERROR",
    "message": "Email is required",
    "status": 400
  }
}
```

4. **Never expose internal details** to clients (stack traces, DB queries, file paths)
5. **Use appropriate HTTP status codes** (400, 401, 403, 404, 422, 500)

**Logging best practices:**

- **Structured logging** — use JSON format for machine-readable logs
- **Log levels:** ERROR (failures), WARN (potential issues), INFO (key events), DEBUG (development)
- **Include context:** request ID, user ID, timestamp, operation name
- **Centralized logging:** aggregate logs with ELK stack, Datadog, or similar
- **Correlation IDs:** trace a request across multiple services
- **Never log:** passwords, tokens, personal data, credit card numbers

Question 11

What is SQL injection and how do you prevent it?

Accepted Answer

**SQL injection** is an attack where malicious SQL code is inserted into an input field and executed by the database. It is consistently in the OWASP Top 10 most critical web application vulnerabilities.

**Example of vulnerable code:**

```python
# DANGEROUS — never do this
query = f"SELECT * FROM users WHERE email = '{user_input}'"
# Attacker input: ' OR '1'='1
# Resulting query: SELECT * FROM users WHERE email = '' OR '1'='1'
# Returns all rows!
```

**Prevention — use parameterized queries:**

```python
# SAFE — parameterized query
cursor.execute("SELECT * FROM users WHERE email = ?", (user_input,))

# Node.js / PostgreSQL
const result = await pool.query(
  'SELECT * FROM users WHERE email = $1',
  [userInput]
);
```

**Additional defenses:**
- **ORM/query builders** — most ORMs (Prisma, SQLAlchemy, Hibernate) use parameterized queries by default
- **Input validation** — validate and sanitize all user input before processing
- **Least privilege** — database users should have only the permissions they need
- **Web Application Firewall (WAF)** — as a last line of defense, not a primary control
- **Stored procedures** — can help but are not foolproof if they concatenate strings internally

Prepared statements and parameterized queries are equivalent: both separate SQL code from data so user input is always treated as a value, never executable code.

Question 12

What is CORS? How do you configure it correctly?

Accepted Answer

**CORS (Cross-Origin Resource Sharing)** is a browser security mechanism that restricts HTTP requests made from a different origin (protocol + domain + port) than the server's origin. It prevents malicious websites from making authenticated requests to your API using the visitor's credentials.

**How it works:**
1. Browser sends an HTTP request with an `Origin` header
2. Server responds with `Access-Control-Allow-Origin` header
3. Browser allows or blocks the response based on the header

**Configuring CORS in Node.js (Express):**

```javascript
import cors from 'cors';

app.use(cors({
  origin: ['https://myapp.com', 'https://staging.myapp.com'],
  methods: ['GET', 'POST', 'PUT', 'DELETE', 'PATCH'],
  allowedHeaders: ['Content-Type', 'Authorization'],
  credentials: true,
  maxAge: 86400,
}));
```

**Common mistakes:**
- Using `origin: '*'` with `credentials: true` — browsers block this combination
- Allowing all origins in production — exposes API to any website
- Forgetting to allow custom headers like `Authorization`

**Preflight requests:** The browser automatically sends an `OPTIONS` request before cross-origin requests that use non-simple methods (PUT, DELETE, PATCH) or custom headers. The server must respond with appropriate CORS headers for the actual request to proceed. Set `maxAge` to cache the preflight response and reduce OPTIONS requests.

Question 13

What is the OWASP Top 10? Describe at least three vulnerabilities from it.

Accepted Answer

The **OWASP Top 10** is a standard list of the most critical web application security risks, updated periodically by the Open Web Application Security Project.

**Key vulnerabilities:**

**1. Broken Access Control (A01)**
Users can access resources or perform actions beyond their permissions. Example: a regular user accessing `/admin/users` or modifying another user's data by changing the `user_id` in a request.

Prevention: enforce authorization on every endpoint, validate ownership of resources server-side, deny by default.

**2. Cryptographic Failures (A02)**
Sensitive data (passwords, credit cards, health data) exposed due to weak encryption or no encryption. Example: storing passwords in plaintext or using MD5 for password hashing.

Prevention: hash passwords with bcrypt/Argon2, use TLS everywhere, never store sensitive data you don't need.

**3. Injection (A03)**
Untrusted data sent to an interpreter as part of a command. SQL injection is the classic example; others include command injection, LDAP injection, and NoSQL injection.

Prevention: parameterized queries, input validation, allowlists.

**4. Security Misconfiguration (A05)**
Default credentials, unnecessary features enabled, verbose error messages, missing security headers.

Prevention: security hardening checklists, infrastructure-as-code, regular audits.

**5. Identification and Authentication Failures (A07)**
Weak passwords allowed, no rate limiting on login, sessions not invalidated on logout.

Prevention: MFA, rate limiting, secure session management, strong password policies.

Question 14

How does Redis work? What are its common use cases in a backend system?

Accepted Answer

**Redis** is an in-memory data structure store that can be used as a database, cache, and message broker. Because all data lives in RAM, read and write operations are typically completed in under a millisecond.

**Core data structures:**
- **String** — simple key-value, counters (`INCR`, `DECR`)
- **Hash** — field-value pairs within a key (user objects)
- **List** — ordered sequences, stacks and queues
- **Set** — unique unordered members (tags, unique visitors)
- **Sorted Set** — members with scores, leaderboards
- **Stream** — append-only log for event sourcing

**Common use cases:**

```
# Session storage
SET session:abc123 '{"userId":1,"role":"admin"}' EX 3600

# Rate limiting (sliding window counter)
INCR rate:user:42
EXPIRE rate:user:42 60

# Leaderboard
ZADD leaderboard 1500 "player:42"
ZRANGE leaderboard 0 9 WITHSCORES REV

# Pub/sub for real-time notifications
PUBLISH notifications '{"type":"order_shipped","orderId":99}'
```

**Persistence modes:**
- **RDB (Redis Database)** — periodic snapshots to disk. Compact, fast restarts, but may lose data since last snapshot.
- **AOF (Append-Only File)** — logs every write operation. More durable (can configure `fsync` per second or always), larger files, slower restarts.
- **No persistence** — pure cache mode, fastest, all data lost on restart.

For production caches choose RDB or no persistence. For durable storage combine both modes.

Question 15

How would you implement rate limiting in an API?

Accepted Answer

**Rate limiting** controls how many requests a client can make in a given time window. It protects the API from abuse, brute-force attacks, and accidental overload.

**Common algorithms:**

**Fixed Window:**
- Count requests in fixed time buckets (e.g., 100 requests per minute)
- Simple and memory-efficient
- Weakness: burst traffic at window boundaries (200 requests in 2 seconds spanning two windows)

**Sliding Window:**
- Count requests in a rolling time window from each request's perspective
- More accurate, prevents boundary bursts
- Slightly more complex to implement

**Token Bucket:**
- Tokens are added at a constant rate, each request consumes a token
- Allows short bursts up to bucket capacity
- Good for APIs that want to allow occasional bursts

**Implementation with Redis:**

```javascript
async function isRateLimited(userId, limit = 100, windowSec = 60) {
  const key = `rate:${userId}:${Math.floor(Date.now() / (windowSec * 1000))}`;
  const count = await redis.incr(key);
  if (count === 1) {
    await redis.expire(key, windowSec);
  }
  return count > limit;
}
```

**Response headers to include:**
```
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 42
X-RateLimit-Reset: 1700000060
Retry-After: 30  (when 429 is returned)
```

Return HTTP `429 Too Many Requests` when the limit is exceeded. Rate limit by IP for unauthenticated endpoints and by user ID for authenticated ones.

Question 16

What is a CDN and how does it improve backend performance?

Accepted Answer

A **CDN (Content Delivery Network)** is a geographically distributed network of servers (edge nodes) that cache and serve content from the location closest to the user, reducing latency and offloading traffic from the origin server.

**How it helps:**
- **Reduced latency** — user fetches assets from a nearby edge node instead of a distant origin server
- **Reduced origin load** — cached responses are served by the CDN, not your server
- **DDoS protection** — CDNs can absorb large traffic spikes and filter malicious traffic
- **Automatic compression** — Gzip/Brotli compression without origin server work
- **TLS termination** — CDN handles HTTPS handshakes, reducing origin CPU load

**What to cache on a CDN:**
- Static assets: images, CSS, JavaScript bundles, fonts
- Publicly accessible API responses that change infrequently
- Server-rendered HTML pages with appropriate `Cache-Control` headers

**Cache control headers:**
```
Cache-Control: public, max-age=31536000, immutable  # Static assets with hash
Cache-Control: public, max-age=300, stale-while-revalidate=60  # API responses
Cache-Control: no-store  # Sensitive/private data
```

**NOT suitable for CDN caching:**
- Authenticated or personalized content (user dashboards, account pages)
- Real-time data that must be fresh (live stock prices, chat messages)
- POST/PUT/DELETE requests — CDNs only cache GET/HEAD by default
- Responses with `Set-Cookie` headers — CDN may strip cookies

Popular CDNs include Cloudflare, AWS CloudFront, Fastly, and Akamai.

Question 17

What is a message queue? When would you use one instead of a direct API call?

Accepted Answer

A **message queue** is a form of asynchronous communication between services where a producer sends a message to a queue and a consumer reads and processes it independently. The producer does not wait for the consumer to finish.

**When to use a message queue instead of a direct API call:**

- **Decoupling** — producer and consumer don't need to be running at the same time
- **Absorbing traffic spikes** — queue buffers bursts so consumers process at a steady rate
- **Reliability** — if the consumer is down, messages wait in the queue rather than being lost
- **Long-running tasks** — image processing, email sending, PDF generation — don't block HTTP responses
- **Fan-out** — one event needs to trigger multiple independent consumers (order placed → send email + update inventory + notify warehouse)

**Real-world example:**
```
User places order → API returns 200 immediately
         ↓
  [Order Queue]
         ↓
  ┌──────────────┐   ┌──────────────┐   ┌──────────────┐
  │ Email Worker │   │Inventory Svc │   │Warehouse Svc │
  └──────────────┘   └──────────────┘   └──────────────┘
```

**Delivery guarantees:**
- **At-most-once** — message delivered 0 or 1 times, possible loss (fire and forget)
- **At-least-once** — message delivered 1 or more times, possible duplicates (most queues)
- **Exactly-once** — delivered exactly once, hardest to achieve, requires idempotency

**Popular tools:** RabbitMQ, AWS SQS, Apache Kafka, Google Pub/Sub, Redis Streams.

Always design consumers to be **idempotent** — processing the same message twice should produce the same result as processing it once.

Question 18

What is Apache Kafka and how does it differ from traditional message queues?

Accepted Answer

**Apache Kafka** is a distributed event streaming platform designed for high-throughput, fault-tolerant, real-time data pipelines. It is fundamentally different from traditional message queues like RabbitMQ.

**Key differences:**

| Aspect | Traditional Queue (RabbitMQ) | Kafka |
|--------|------------------------------|-------|
| Model | Push to consumer | Pull by consumer |
| Message retention | Deleted after consumption | Retained for configurable period |
| Ordering | Per-queue | Per-partition |
| Replay | Not supported | Supported (seek to any offset) |
| Throughput | Millions/day | Millions/second |
| Consumer | Competes for messages | Each consumer group gets all messages |

**Core concepts:**
- **Topic** — a named stream of messages (like a table in a DB)
- **Partition** — topics are split into partitions for parallelism. Messages within a partition are ordered.
- **Offset** — sequential ID for each message in a partition. Consumers track their position.
- **Producer** — writes messages to topics
- **Consumer** — reads messages from topics
- **Broker** — a Kafka server. A cluster has multiple brokers for redundancy.

**Consumer groups:**
A consumer group is a set of consumers that cooperate to consume a topic. Kafka distributes partitions across consumers in the group — each partition is consumed by exactly one consumer in the group at a time. Multiple groups can read the same topic independently, enabling fan-out.

**When to choose Kafka over a queue:**
- You need event replay or audit logs
- Multiple independent systems need the same events
- Very high throughput (millions of events per second)
- Event sourcing or CQRS architecture

Question 19

What is the difference between synchronous and asynchronous communication in distributed systems?

Accepted Answer

**Synchronous communication** means the caller waits for the response before continuing. HTTP REST and gRPC calls are typical examples. The caller is blocked until the callee responds.

**Asynchronous communication** means the caller sends a message and continues without waiting. The response (if any) arrives later. Message queues and event buses are common mechanisms.

**Comparison:**

| Aspect | Synchronous | Asynchronous |
|--------|-------------|-------------|
| Coupling | Tight (caller needs callee available) | Loose (independent availability) |
| Latency | Response time = callee processing time | Near-instant acknowledgment |
| Complexity | Simpler mental model | More complex (ordering, retries, idempotency) |
| Failure handling | Easy — callee returns an error | Harder — dead letter queues, retries |
| Use case | Queries, real-time results needed | Background work, fan-out, high throughput |

**Event-driven architecture trade-offs:**

Advantages:
- Services are decoupled and independently deployable
- Natural scalability — add more consumers to handle load
- Resilience — services can tolerate each other's downtime
- Easy audit trail if events are stored durably

Disadvantages:
- Eventual consistency — data may be temporarily inconsistent across services
- Debugging is harder — tracing a request across multiple async steps requires distributed tracing
- Ordering guarantees are complex in distributed systems
- At-least-once delivery requires idempotent consumers
- Increased infrastructure complexity (queue management, dead letter handling)

Question 20

What is observability? How is it different from monitoring?

Accepted Answer

**Monitoring** tells you whether a system is working — it answers predefined questions using dashboards and alerts. You know what to look for ahead of time.

**Observability** is the ability to understand the internal state of a system from its external outputs — it answers *unknown* questions. A system is observable if you can diagnose new problems without deploying new instrumentation.

**The three pillars of observability:**

**1. Metrics**
Numerical measurements over time. Aggregated, low-cardinality data ideal for dashboards and alerting.
```
http_request_duration_seconds{method="GET", route="/users", status="200"} 0.045
error_rate 0.02
queue_depth 142
```
Tools: Prometheus, Datadog, CloudWatch

**2. Logs**
Immutable, timestamped records of events. High detail but high volume. Best for debugging specific requests.
```json
{"level":"error","time":"2026-04-01T10:00:00Z","requestId":"abc-123","userId":42,"msg":"Payment failed","reason":"card_declined"}
```
Tools: ELK Stack, Loki, Datadog Logs

**3. Traces**
End-to-end records of a request's journey across services. Each step is a **span** with timing, metadata, and parent-child relationships.
```
[request abc-123]
  ├── API Gateway       5ms
  ├── Auth Service     12ms
  ├── Order Service    38ms
  │     ├── DB query   25ms
  │     └── Redis get   3ms
  └── Payment Service  95ms  ← bottleneck
```
Tools: Jaeger, Zipkin, Datadog APM, OpenTelemetry

The key enabling standard is **OpenTelemetry** — a vendor-neutral SDK for emitting all three signals from your application code.

Question 21

What is an SLO? How do you define and measure one?

Accepted Answer

**SLO (Service Level Objective)** is a target value for a service reliability metric. It defines what "good enough" means for your service from the user's perspective.

**Related terms:**
- **SLI (Service Level Indicator)** — the actual measurement (e.g., % of requests under 200ms)
- **SLO** — the target for that measurement (e.g., 99.5% of requests under 200ms)
- **SLA (Service Level Agreement)** — a contractual commitment to customers with consequences for violation

**Common SLIs and SLOs:**

| SLI | Example SLO |
|-----|-------------|
| Availability | 99.9% of requests return non-5xx responses |
| Latency | 95% of requests complete in < 200ms, 99% < 1s |
| Error rate | < 0.1% of requests result in errors |
| Throughput | Process > 1000 events per second |

**Defining a good SLO:**
1. Start from user experience — what degradation do users actually notice?
2. Look at historical data — what have you actually achieved?
3. Set a target slightly below your historical best to leave headroom
4. Measure at the user-facing boundary, not internal components

**Error budget:** If your SLO is 99.9% availability, you have 0.1% allowed downtime — about 43 minutes per month. This is your error budget.

Teams use error budgets to balance reliability and velocity:
- Budget remaining → deploy freely, experiment, take risks
- Budget nearly exhausted → freeze deployments, focus on reliability
- Budget exhausted → postmortem required, no new features until reliability improves

This framework (from Google's SRE book) gives engineering teams a rational, data-driven way to make reliability decisions.

Question 22

What are database transactions and ACID properties?

Accepted Answer

A **transaction** is a sequence of database operations that are treated as a single unit. Either all operations succeed (commit) or none of them apply (rollback). Transactions ensure data integrity even in the presence of failures or concurrent access.

**ACID properties:**

- **Atomicity** — all operations in a transaction succeed or none do. No partial updates.
- **Consistency** — a transaction brings the database from one valid state to another. Constraints and rules are always satisfied.
- **Isolation** — concurrent transactions don't interfere with each other. Each transaction sees a consistent view of the data.
- **Durability** — once committed, the transaction persists even if the system crashes (written to disk).

**Example:**

```sql
BEGIN;
  UPDATE accounts SET balance = balance - 100 WHERE id = 1;
  UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;
-- If either update fails, both are rolled back
```

**Isolation levels (weakest to strongest):**

| Level | Dirty Read | Non-Repeatable Read | Phantom Read |
|-------|-----------|--------------------|--------------|
| Read Uncommitted | Possible | Possible | Possible |
| Read Committed | Prevented | Possible | Possible |
| Repeatable Read | Prevented | Prevented | Possible |
| Serializable | Prevented | Prevented | Prevented |

- **Dirty read** — reading uncommitted data from another transaction
- **Non-repeatable read** — same row returns different values in the same transaction
- **Phantom read** — a query returns different rows if run twice (another transaction inserted/deleted)

Most databases default to **Read Committed**. PostgreSQL's default is Read Committed; MySQL InnoDB uses Repeatable Read.

Question 23

What is N+1 query problem? How do you detect and fix it?

Accepted Answer

The **N+1 query problem** occurs when an application fetches N records and then executes one additional query for each record, resulting in N+1 total queries instead of the expected 1 or 2.

**Example — N+1 problem:**

```javascript
// 1 query to get all posts
const posts = await Post.findAll();

// N queries — one per post to get its author
for (const post of posts) {
  const author = await User.findById(post.authorId); // N queries!
  console.log(post.title, author.name);
}
// Total: 1 + N queries
```

**How to detect it:**
- Enable SQL query logging in development and look for repeating patterns
- Use ORM tools like Django Debug Toolbar, Bullet gem (Rails), or Hibernate statistics
- Query count spikes when list size grows

**Fix 1 — JOIN query:**

```sql
SELECT posts.*, users.name AS author_name
FROM posts
JOIN users ON users.id = posts.author_id;
```

**Fix 2 — Eager loading (ORM):**

```javascript
// Prisma — include related data in one query
const posts = await prisma.post.findMany({
  include: { author: true },
});

// Sequelize
const posts = await Post.findAll({ include: [User] });
```

**Fix 3 — DataLoader pattern (batching):**

Used in GraphQL — collect all IDs, then fetch in a single query:

```javascript
const userLoader = new DataLoader(async (ids) => {
  const users = await User.findMany({ where: { id: { in: ids } } });
  return ids.map(id => users.find(u => u.id === id));
});
```

The N+1 problem is one of the most common causes of poor API performance in applications using ORMs.

Question 24

What are database migrations? How do you manage schema changes safely in production?

Accepted Answer

A **database migration** is a version-controlled script that describes a change to the database schema (adding a table, renaming a column, adding an index) or to the data itself. Migrations allow schema evolution to be tracked in version control and applied consistently across all environments.

**Why migrations matter:**
- Schema changes are reproducible and reversible
- All environments (dev, staging, prod) stay in sync
- Team members share the same schema history
- Rollback is possible if something goes wrong

**Migration file example (SQL):**

```sql
-- migration: 20260401_add_email_verified_to_users.sql
ALTER TABLE users ADD COLUMN email_verified BOOLEAN NOT NULL DEFAULT FALSE;
CREATE INDEX idx_users_email_verified ON users(email_verified);
```

**Tools:** Flyway, Liquibase, Prisma Migrate, Alembic (Python), Rails ActiveRecord Migrations.

**Safe production practices:**
- Never modify migrations that have already been applied — create a new one
- Test migrations against a production-size data clone before applying
- Always have a rollback migration ready
- Use transactions for DDL operations where the database supports it

**Zero-downtime migrations** are needed when tables are large and the migration would lock the table for a long time (e.g., adding a non-nullable column, building an index).

Strategy — expand/contract (also called parallel change):
1. **Expand** — add new column as nullable, deploy app that writes to both old and new column
2. **Backfill** — populate existing rows in batches
3. **Contract** — make column non-nullable, remove old column, deploy app that only uses new column

For index creation, use `CREATE INDEX CONCURRENTLY` in PostgreSQL to build without locking.

Question 25

What is CI/CD? Describe a typical pipeline for a backend service.

Accepted Answer

**CI (Continuous Integration)** is the practice of frequently merging code changes into a shared branch and automatically running tests and quality checks on each merge.

**CD (Continuous Delivery/Deployment)** extends CI by automatically preparing and optionally deploying the application to production after tests pass.

**Typical backend CI/CD pipeline:**

```
Code Push → Pull Request
         ↓
     [CI Pipeline]
  1. Install dependencies
  2. Lint & type check
  3. Unit tests
  4. Integration tests
  5. Build Docker image
  6. Security scan (Trivy, Snyk)
         ↓
     PR Approved → Merge to main
         ↓
     [CD Pipeline]
  7. Build & tag production image
  8. Deploy to staging
  9. Run E2E tests against staging
  10. Deploy to production (canary → full)
  11. Run smoke tests
  12. Notify team
```

**Continuous Delivery vs Continuous Deployment:**
- **Continuous Delivery** — every commit is *releasable*, but a human approves the production deployment. The pipeline automates everything up to staging.
- **Continuous Deployment** — every commit that passes all tests is *automatically deployed* to production. No human gate.

**Best practices:**
- Keep pipelines fast (< 10 minutes) — developers wait for feedback
- Fail fast — run the cheapest checks first
- Deploy the same artifact across all environments (build once, deploy many)
- Use feature flags to decouple deployment from feature release
- Maintain deployment rollback capability

Popular tools: GitHub Actions, GitLab CI, CircleCI, Jenkins, ArgoCD (GitOps).

Question 26

What is horizontal vs vertical scaling? When would you choose each?

Accepted Answer

**Vertical scaling (scale up)** means adding more resources (CPU, RAM, disk) to an existing server.

**Horizontal scaling (scale out)** means adding more server instances and distributing load across them using a load balancer.

**Comparison:**

| Aspect | Vertical | Horizontal |
|--------|----------|------------|
| How | Bigger machine | More machines |
| Cost | Exponentially expensive at high end | Linear cost |
| Downtime | Often requires restart | Zero-downtime with rolling deploys |
| Limit | Hardware maximum | Theoretically unlimited |
| Complexity | Simple | Requires stateless design, load balancing |
| Single point of failure | Yes | No |

**When to choose vertical:**
- Stateful applications that are hard to distribute (legacy monoliths)
- Database servers (easier to manage, ACID on one node)
- Quick fix for immediate capacity needs
- Cost is still linear (cloud instance upgrades)

**When to choose horizontal:**
- Stateless services (web servers, APIs) — trivial to distribute
- Need fault tolerance (one instance failure doesn't affect availability)
- Traffic patterns have large peaks (auto-scale up and back down)
- Cost optimization at scale

**Challenges of horizontal scaling:**
- **Session management** — sessions can't live in server memory; use Redis or JWTs
- **Distributed state** — caches, locks, counters must be shared (Redis, ZooKeeper)
- **Data consistency** — multiple writers to a distributed database
- **Sticky sessions** — some protocols require the same server for a session
- **Network overhead** — inter-service communication adds latency
- **Observability** — logs and metrics from many instances need aggregation

Question 27

What is a deadlock in databases? How do you prevent it?

Accepted Answer

A **deadlock** occurs when two or more transactions are each waiting for a lock held by the other, creating a cycle where none can proceed.

**Classic example:**

```
Transaction A:                    Transaction B:
  LOCK row 1 (success)             LOCK row 2 (success)
  Wait for lock on row 2...        Wait for lock on row 1...
                 ← DEADLOCK →
```

```sql
-- Transaction A
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;  -- locks row 1
UPDATE accounts SET balance = balance + 100 WHERE id = 2;  -- waits for row 2

-- Transaction B (concurrent)
BEGIN;
UPDATE accounts SET balance = balance - 50 WHERE id = 2;   -- locks row 2
UPDATE accounts SET balance = balance + 50 WHERE id = 1;   -- waits for row 1 → DEADLOCK
```

**How databases resolve deadlocks:**
The database periodically checks for wait-for cycles. When detected, it picks a **victim** transaction (usually the one that has done the least work) and rolls it back, allowing the other to proceed. The application must retry the rolled-back transaction.

**Prevention strategies:**

1. **Consistent lock ordering** — always acquire locks in the same order across all transactions (e.g., always lock the lower ID first)
2. **Keep transactions short** — minimize the time locks are held
3. **Use lower isolation levels** when strict isolation is not needed
4. **Optimistic concurrency** — don't lock at read time, check for conflicts at write time
5. **Avoid user input during transactions** — never hold a lock while waiting for user response
6. **SELECT FOR UPDATE SKIP LOCKED** — skip locked rows instead of waiting, useful for job queues

Question 28

What is the difference between REST and GraphQL?

Accepted Answer

**REST** is an architectural style that uses HTTP methods and resource-oriented URLs. Each resource has a fixed response shape.

**GraphQL** is a query language for APIs that lets clients specify exactly what data they need in a single request.

**Key differences:**

| Aspect | REST | GraphQL |
|--------|------|--------|
| Endpoints | Multiple (`/users`, `/posts`) | Single (`/graphql`) |
| Response shape | Fixed by server | Defined by client |
| Over/under-fetching | Common | Solved |
| Versioning | URL or headers | Schema evolution (deprecation) |
| Type system | OpenAPI (optional) | Built-in, enforced |
| Caching | HTTP cache (CDN-friendly) | Complex (POST requests) |
| Learning curve | Low | Higher |

**GraphQL example:**

```graphql
query {
  user(id: 42) {
    name
    email
    posts(limit: 5) {
      title
      publishedAt
    }
  }
}
```

This fetches exactly what is needed in one request — no over-fetching (extra fields) or under-fetching (multiple requests).

**Disadvantages of GraphQL:**
- **Caching is hard** — queries are POST requests; HTTP caching doesn't work without extra tooling (persisted queries)
- **N+1 problem** — without DataLoader, deeply nested queries trigger many DB queries
- **Schema complexity** — requires maintaining a typed schema and resolvers
- **Security** — clients can craft expensive queries; need query depth/complexity limits
- **File uploads** — not natively supported, requires workarounds
- **Overkill for simple APIs** — REST is simpler when you control both client and server

**When to prefer GraphQL:** Multiple client types (web, mobile, third-party) with different data needs; data-heavy apps with complex, nested relationships.

Question 29

How do you store and verify passwords securely?

Accepted Answer

**Never store passwords in plaintext.** If your database is compromised, plaintext passwords expose users on every service where they reused that password.

**Correct approach — use a password hashing algorithm:**

Password hashing algorithms are specifically designed to be slow (computationally expensive), making brute-force attacks impractical.

**Recommended algorithms (in order of preference):**
1. **Argon2id** — winner of the Password Hashing Competition (2015), best current choice
2. **bcrypt** — widely supported, battle-tested since 1999
3. **scrypt** — memory-hard, good alternative to bcrypt

**Implementation with bcrypt (Node.js):**

```javascript
import bcrypt from 'bcrypt';

const SALT_ROUNDS = 12;

async function hashPassword(plaintext) {
  return bcrypt.hash(plaintext, SALT_ROUNDS);
}

async function verifyPassword(plaintext, hash) {
  return bcrypt.compare(plaintext, hash);
}

// Usage
const hash = await hashPassword('myP@ssw0rd');
await verifyPassword('myP@ssw0rd', hash); // true
await verifyPassword('wrongpassword', hash); // false
```

**Why MD5/SHA-256 are unsuitable for passwords:**
- They are designed to be **fast** — a GPU can compute billions of SHA-256 hashes per second
- No built-in salting — identical passwords produce identical hashes, enabling rainbow table attacks
- bcrypt with cost factor 12 takes ~250ms; SHA-256 takes nanoseconds

**Additional best practices:**
- Use a unique random **salt** per password (bcrypt/Argon2 do this automatically)
- Set a minimum password length (12+ characters)
- Check passwords against breach databases (HaveIBeenPwned API)
- Never log passwords, even hashed ones

Question 30

What is structured logging and why is it important in distributed systems?

Accepted Answer

**Structured logging** means emitting logs as machine-readable key-value pairs (typically JSON) rather than unstructured text strings.

**Unstructured log (hard to query):**
```
[2026-04-01 10:23:15] ERROR Failed to process payment for user 42, order 99, reason: card_declined
```

**Structured log (queryable, filterable):**
```json
{
  "timestamp": "2026-04-01T10:23:15Z",
  "level": "error",
  "message": "Payment processing failed",
  "userId": 42,
  "orderId": 99,
  "reason": "card_declined",
  "service": "payment-service",
  "requestId": "req-abc-123",
  "durationMs": 145
}
```

**Why structured logging matters:**
- Log aggregation platforms (Datadog, Splunk, Loki) can index and query fields
- You can filter: `level:error AND service:payment-service` — instantly
- Metrics can be derived from logs (error rate, p99 latency)
- Consistent format makes automated alerting reliable

**Correlation IDs in distributed systems:**

When a user request flows through multiple microservices (API gateway → order service → payment service → notification service), each service generates its own logs. Without a shared identifier, it is impossible to stitch together the full trace of a single request.

A **correlation ID** (also called request ID or trace ID) is a unique identifier generated at the entry point and propagated through every downstream call via HTTP headers:

```
X-Request-ID: req-abc-123
```

Each service includes this ID in every log line. When debugging an issue, you filter all logs by the correlation ID and instantly see the complete request journey across all services — timestamps, durations, errors — without guessing which log entries belong together.

Question 31

What is the difference between concurrency and parallelism? How do you handle concurrent requests in a backend?

Accepted Answer

**Concurrency** means dealing with multiple tasks at the same time — tasks make progress by interleaving (one pauses while another runs). A single CPU can be concurrent.

**Parallelism** means executing multiple tasks simultaneously on different CPU cores. Requires multiple processors.

*Concurrency is about structure; parallelism is about execution.*

**Handling concurrent requests in a backend:**

**Thread-based (Java, .NET, Go):**
- Each request gets a thread from a thread pool
- Thread blocks while waiting for I/O (DB, HTTP calls)
- Simple mental model, but threads are expensive in memory (~1MB each)

**Event loop / async I/O (Node.js, Python asyncio):**
- Single-threaded event loop handles many requests
- I/O operations are non-blocking — the loop handles other requests while waiting
- CPU-bound work blocks the loop (offload to worker threads)

```javascript
// Non-blocking — event loop continues while waiting for DB
async function getUser(id) {
  const user = await db.query('SELECT * FROM users WHERE id = $1', [id]);
  return user;
}
```

**Go goroutines:**
- Lightweight coroutines (~2KB) managed by the Go runtime
- True parallelism across CPU cores
- Channels for safe communication between goroutines

**Race condition** — when the outcome depends on the unpredictable timing of concurrent operations:

```javascript
// Race condition — two requests read balance=100, both subtract 50
// Final balance is 50 instead of 0
const balance = await getBalance(userId);  // both read 100
await setBalance(userId, balance - 50);    // both write 50
```

**Prevention strategies:**
- **Database transactions with row locking:** `SELECT ... FOR UPDATE`
- **Optimistic locking:** version column, retry on conflict
- **Atomic operations:** `UPDATE accounts SET balance = balance - 50 WHERE balance >= 50`
- **Distributed locks:** Redis `SET NX EX` for cross-service locks

Backend Developer Interview Questions

Programming Fundamentals

Databases

API Design

DevOps

Security

Caching

Messaging

Monitoring

Use these questions in your next interview

More Interview Questions

JavaScript Interview Questions

React Interview Questions

Frontend Developer Interview Questions

Level	Dirty Read	Non-Repeatable Read	Phantom Read
Read Uncommitted	Possible	Possible	Possible
Read Committed	Prevented	Possible	Possible
Repeatable Read	Prevented	Prevented	Possible
Serializable	Prevented	Prevented	Prevented

Aspect	Authentication	Authorization
Question	Who are you?	What can you access?
Example	Login with password	Admin vs. regular user
Happens	First	After authentication
Methods	Password, OAuth, biometrics	Roles, permissions, ACLs

Aspect	REST	GraphQL
Endpoints	Multiple (`/users`, `/posts`)	Single (`/graphql`)
Response shape	Fixed by server	Defined by client
Over/under-fetching	Common	Solved
Versioning	URL or headers	Schema evolution (deprecation)
Type system	OpenAPI (optional)	Built-in, enforced
Caching	HTTP cache (CDN-friendly)	Complex (POST requests)
Learning curve	Low	Higher

Aspect	Container	Virtual Machine
Isolation	Process-level (shared kernel)	Full OS (own kernel)
Size	Megabytes	Gigabytes
Startup	Seconds	Minutes
Overhead	Minimal	Significant
Portability	Very high	High

Aspect	Vertical	Horizontal
How	Bigger machine	More machines
Cost	Exponentially expensive at high end	Linear cost
Downtime	Often requires restart	Zero-downtime with rolling deploys
Limit	Hardware maximum	Theoretically unlimited
Complexity	Simple	Requires stateless design, load balancing
Single point of failure	Yes	No

Aspect	Traditional Queue (RabbitMQ)	Kafka
Model	Push to consumer	Pull by consumer
Message retention	Deleted after consumption	Retained for configurable period
Ordering	Per-queue	Per-partition
Replay	Not supported	Supported (seek to any offset)
Throughput	Millions/day	Millions/second
Consumer	Competes for messages	Each consumer group gets all messages

Aspect	Synchronous	Asynchronous
Coupling	Tight (caller needs callee available)	Loose (independent availability)
Latency	Response time = callee processing time	Near-instant acknowledgment
Complexity	Simpler mental model	More complex (ordering, retries, idempotency)
Failure handling	Easy — callee returns an error	Harder — dead letter queues, retries
Use case	Queries, real-time results needed	Background work, fan-out, high throughput

SLI	Example SLO
Availability	99.9% of requests return non-5xx responses
Latency	95% of requests complete in < 200ms, 99% < 1s
Error rate	< 0.1% of requests result in errors
Throughput	Process > 1000 events per second