Building Kudu: A Secure Containerized Code Execution Platform – Part 1: A Deep Technical Dive

Kudu is an MIT-licensed, open-source, web-based code execution platform with a client-server architecture that provides secure, containerized code execution. It leverages containerization, real-time monitoring, and secure execution environments. In this technical deep dive series, we explore its architecture, implementation details, and security considerations.

Before diving deep into the technicalities of the project, a good question to ask is why we might need to run code in a secure environment. It turns out there are many reasons to do so. The following are just a few of them:

Isolation and Security
Sandbox Environment: Secure environments provide a sandbox environment that isolates the code execution from the host system. This means that any harmful operations performed by the code will not affect the host system.
Security: Running code in a secure environment reduces the risk of security breaches, as the containers can be configured with limited permissions and resources.
Consistency and Reproducibility
Environment Consistency: A secure environment ensures that the code runs in the same environment across different machines. This eliminates the “it works on my machine” problem by providing a consistent runtime environment.
Reproducibility: Engineers can share images to ensure that others can reproduce the exact environment needed to run the code.
Language and Tooling Support
Multi-language Support: Engineers can run code in different programming languages without needing to install and configure the necessary tools and dependencies on their local machines.
Custom Tooling: Engineers can create custom images with specific tools and libraries pre-installed, catering to specialized needs.
Resource Management
Controlled Resource Allocation: A secure environment allows engineers to control the resources (CPU, memory) allocated to each container, ensuring that resource-intensive operations do not affect the host system’s performance.
Scalability: Containers can be easily scaled up or down, allowing for efficient resource utilization in a shared environment.
Ease of Setup and Use
Quick Setup: Engineers can start coding immediately without spending time setting up the environment. This is especially useful for on-boarding new team members or running quick experiments.
Simplified Maintenance: Images can be updated and maintained centrally, reducing the overhead of managing dependencies and environment configurations on multiple developer machines.
Collaboration and Sharing
Code Sharing: Engineers can share code snippets along with their execution environments, making it easier to collaborate and review code.
Consistent Testing: Ensuring that tests run in a consistent environment helps catch environment-specific issues early in the development process.
Learning and Experimentation
Educational Use: Beginners can experiment with different programming languages and tools without worrying about setting up and maintaining the environment.
Experimentation: Engineers can quickly test new technologies, libraries, or frameworks in an isolated environment without affecting their primary development setup.
Cloud and Remote Development
Remote Execution: Code snippets can be run in the cloud, allowing engineers to leverage powerful remote servers and reducing the load on their local machines.
Cross-platform Development: Engineers can work on code that targets different platforms (Linux, Windows, macOS) without needing multiple physical or virtual machines.

Core Architecture

Back-end Infrastructure

The backbone of Kudu is built with Node.js, a free, open-source, cross-platform JavaScript runtime environment, and Express, a web framework for Node.js, following a modular architecture.

For container orchestration, we used Dockerode to manage container life-cycles. Dockerode acts as a versatile tool-set for Docker in Node.js, providing essential functionalities to create, start, stop, and remove containers. To monitor these container operations, we integrated Prometheus metrics, the open-source monitoring system and time series database. Prometheus functions like a personal fitness tracker, continuously monitoring resource usage and performance to ensure optimal operation.

In addition, we implemented a multi-layered security approach using security middleware. We talk more about this implementation in the second part of the series.

const express = require('express');
const app = express();

// Security middleware
app.use(require('helmet')());
app.use(require('cors')());
app.use(require('rate-limit')());

// Health check endpoints
app.get('/health', (req, res) => res.status(200).send('OK'));

// Error handling middleware
app.use((err, req, res, next) => {
  console.error(err);
  res.status(500).send('Something broke!');
});

app.listen(3000, () => console.log('Server running on port 3000'));

The code above sets up an Express.js server with security middleware, a health check endpoint, and error handling.

Importing and Initializing Express

const express = require('express');
const app = express();

express: We first need to import the express module.

2. Security Middleware

app.use(require('helmet')());
app.use(require('cors')());
app.use(require('rate-limit')());

helmet(): This middleware helps secure our application by setting various HTTP headers. For example, it can prevent attacks like cross-site scripting (XSS), clickjacking, and content injection by setting headers like X-Content-Type-Options, X-Frame-Options, etc.
cors(): The CORS (Cross-Origin Resource Sharing) middleware is used to enable or restrict resources to be requested from another domain. It’s very useful in our application for APIs that need to allow requests from different origins (e.g., when a frontend application hosted on a different domain tries to access the backend API).
rate-limit(): This middleware applies rate-limiting to our application, which helps prevent abuse and DoS (Denial of Service) attacks by limiting the number of requests from a specific IP or user within a given time frame.

3. Health Check Endpoint

app.get('/health', (req, res) => res.status(200).send('OK'));

app.get('/health'): This defines a simple GET route that responds to requests at the /health endpoint.
res.status(200).send('OK'): If our server is up and running, it responds with an HTTP status code of 200 (OK) and the message ‘OK’. This is used by our monitoring tool to check if the server is healthy and operational.

4. Error Handling Middleware

app.use((err, req, res, next) => {
  console.error(err);
  res.status(500).send('Something broke!');
});

Error handling middleware: This middleware is designed to catch and handle any errors that occur in our application. It takes four parameters: err, req, res, and next.
- console.error(err): This logs the error to the console for debugging purposes.
- res.status(500).send('Something broke!'): The server responds with a 500 status code (Internal Server Error) and a generic error message 'Something broke!'. This is especially useful for catching unexpected issues and providing a response to the client.

5. Starting the Server

app.listen(3000, () => console.log('Server running on port 3000'));

app.listen(3000): This starts the Express server on port 3000. Once the server is running, it logs the message 'Server running on port 3000' to the console, indicating that the application is live and listening for incoming requests on that port.

The Container Execution Pipeline

The code execution pipeline in Kudu is designed to function like a well-oiled machine, ensuring that code runs smoothly and securely. Here’s an overview of how it works:

Code Validation: Before we can even begin executing any code, we need to validate it to prevent malicious scripts from running.

const validateCode = (code, language) => {
  const blockedPatterns = [
    /process\.env/i,
    /require\s*\(/i,
    /import\s+(?:os|sys|subprocess)/i,
    /open\s*\(/i,
    /eval\s*\(/i,
    /exec\s*\(/i
  ];
  return !blockedPatterns.some(pattern => pattern.test(code));
};

The validateCode function is designed to check whether a given piece of code contains potentially harmful or restricted patterns, which could be used for security exploits or other unsafe operations in our environment.

Parameters:

code: The source code (as a string) that needs to be validated
language: The programming language of the code

Logic:

Blocked patterns: The function defines an array of regular expressions (blockedPatterns) that represent patterns typically associated with dangerous or restricted operations in code. These patterns are:
- /process.env/i: Matches code that tries to access environment variables, potentially exposing sensitive information.
- /require\s*(/i: Matches code that uses Node.js’ require() function, which could be used to load malicious modules.
- /import\s+(?:os|sys|subprocess)/i: Matches import statements for potentially dangerous modules like os, sys, or subprocess in Python-like syntax, which can allow system-level access.
- /open\s*(/i: Matches code that tries to open files, which could potentially be used to read or write to sensitive files.
- /eval\s*(/i: Matches code that uses the eval() function, which can execute arbitrary code and is often a security risk.
- /exec\s*(/i: Matches code that uses the exec() function, which can execute system commands, posing a significant security threat.
Validation: The function checks if any of the patterns in blockedPatterns match the input code using the some method. The some method tests whether at least one of the patterns returns true when applied to the code.
Return Value: The function returns true if none of the blocked patterns are found (i.e., !blockedPatterns.some(…)). If any of the patterns match, it returns false, indicating that the code contains restricted patterns.

2. Container Configuration: Each container needs to be configured with strict resource limits to prevent over-consumption of precious resources.

const containerConfig = {
  HostConfig: {
    AutoRemove: true,
    Memory: 100 * 1024 * 1024, // 100MB
    NanoCPUs: 1e9, // 1 CPU
    NetworkMode: 'none',
    OomKillDisable: false,
    PidsLimit: 100,
    SecurityOpt: ['no-new-privileges'],
    ReadonlyRootfs: true
  }
};

The containerConfig object above is a configuration object used to define specific settings for a container when it is being created or started.

HostConfig:
This section defines various configurations related to the host system’s behavior when running the container.

AutoRemove: true
This setting automatically removes the container when it stops. It’s useful for ensuring that containers don’t accumulate unnecessarily after they finish running, preventing clutter and saving disk space.
Memory: 100 * 1024 * 1024 (100MB):
This defines the amount of memory allocated to the container. In this case, it’s set to 100 MB (100 * 1024 * 1024 bytes). This is the memory limit for the container’s processes, ensuring it doesn’t exceed this threshold.
NanoCPUs: 1e9 (1 CPU):
This defines the CPU allocation for the container in “nanoCPUs.” The value 1e9 is equivalent to 1 CPU (1,000,000,000 nanoseconds of CPU time). This setting helps in limiting or allocating CPU resources to the container.
NetworkMode: 'none':
This setting determines whether the container will be killed by the Out Of Memory (OOM) killer in case of memory exhaustion. Setting it to false means that if the container exceeds its memory limit, the OOM killer may terminate it.
PidsLimit: 100:
This specifies the maximum number of processes (PID) that the container can create. In this case, the limit is set to 100. Limiting the number of processes can help control resource usage and prevent container misbehavior or resource exhaustion.
SecurityOpt: ['no-new-privileges']:
This security option disables the ability to gain new privileges within the container. With no-new-privileges, even if the code inside the container tries to escalate privileges (e.g., through a vulnerable process), the container will block it, enhancing security by reducing the risk of privilege escalation.
ReadonlyRootfs: true:
This option makes the container’s root filesystem read-only. By setting it to true, the container’s file system is locked down, which means no writes are allowed to the root file system. This is a security measure that helps prevent unauthorized changes to the container’s system files or malicious modifications.

3. Execution Monitoring: Monitoring is crucial for keeping track of container performance. Lets explore how we set this up with Prometheus:

const metrics = {
  executionDuration: new promClient.Histogram({
    name: "code_execution_duration_seconds",
    help: "Duration of code execution in seconds",
    labelNames: ["language"]
  }),
  memoryUsage: new promClient.Gauge({
    name: "container_memory_usage_bytes",
    help: "Memory usage of containers",
    labelNames: ["container_id"]
  }),
  cpuUsage: new promClient.Gauge({
    name: "container_cpu_usage_percent",
    help: "CPU usage percentage",
    labelNames: ["container_id"]
  })
};

The metrics object above defines three different types of metrics using the promClient library. These metrics track different aspects of container performance and code execution, and they are structured to be used with Prometheus.

executionDuration (Histogram)
– Type: promClient.Histogram

– Purpose: This metric tracks the duration of code execution, specifically in seconds.

– Configuration:
– name: "code_execution_duration_seconds": The name of the metric is code_execution_duration_seconds, which is descriptive of what it measures.
– help: “Duration of code execution in seconds”: This string provides a brief explanation of the metric’s purpose.
– labelNames: ["language"]: This label adds a language tag to the metric, allowing the metric to be categorized based on the programming language used to execute the code.

The Histogram type is used to track the distribution of a set of values (in this case, the code execution duration). It allows us to record the frequency of different execution durations, which helps analyze performance and identify bottlenecks.
memoryUsage (Gauge)
– Type: promClient.Gauge

– Purpose: This metric tracks the memory usage of containers in bytes.

– Configuration:
– name: "container_memory_usage_bytes": The name of the metric is container_memory_usage_bytes, indicating that it monitors memory usage.
– help: "Memory usage of containers": This help string clarifies that the metric represents the memory usage of the containers.
– labelNames: ["container_id"]: This label associates the metric with a specific container by its container_id, allowing us to track memory usage on a per-container basis.

The Gauge type is used for metrics that can go up or down (like memory usage). This metric allows us to track the current memory usage of each container, and the values can change over time as containers use more or less memory.
cpuUsage (Gauge)
– Type: promClient.Gauge

– Purpose: This metric tracks the CPU usage of containers as a percentage.

– Configuration:
– name: "container_cpu_usage_percent": The name of the metric is container_cpu_usage_percent, indicating that it tracks CPU usage.
– help: "CPU usage percentage": This help string clarifies that the metric represents the CPU usage percentage of the container.
– labelNames: ["container_id"]: This label associates the metric with a specific container by its container_id, similar to the memory usage metric we talked about above.

So, like the memoryUsage metric, the Gauge type is used here to track a value that can increase or decrease. This metric measures the CPU usage of each container in terms of percentage, which can be useful for monitoring container performance and identifying resource hogs.

The Container Execution Pipeline is the beating heart of Kudu, ensuring every line of code is validated, securely executed, and monitored in real-time with precision. It’s like watching a finely tuned orchestra where each component plays its part in harmony to deliver a seamless and secure code execution experience. But we’re just scratching the surface here! In this Part One, we’ve covered the foundational architecture, backend setup, and the critical first steps of the execution pipeline. Stay tuned for Part Two, where I’ll dive into advanced security mechanisms, real-time monitoring tools, performance optimizations, and the lessons learned that shaped Kudu into a robust and scalable platform.

Building Kudu: A Secure Containerized Code Execution Platform – Part 1: A Deep Technical Dive

Core Architecture

Leave a Reply Cancel reply