IceGate: Native Rust Observability and the Death of Overpriced Logging
IceGate: Native Rust Observability and the Death of Overpriced Logging
1. Introduction: The Observability Tax and the Breaking Point
In the modern distributed systems landscape, we have reached a paradoxical tipping point. We build microservices to scale efficiently, we adopt Kubernetes to manage complexity, and we use cloud-native tools to move faster. Yet, as our infrastructure scales linearly, our observability costs scale exponentially.
For the average Staff Engineer at a mid-to-large scale enterprise, the "Datadog Bill" or the "Splunk Renewal" has become more than just a line item; it is a strategic bottleneck. I’ve seen organizations where the cost of logging and tracing exceeds the cost of the compute actually running the business logic. This is what I call the Observability Tax.
We have outsourced our critical operational data to third-party SaaS providers who charge us a premium for "ease of use." They build proprietary indexing engines, manage massive clusters on our behalf, and then charge us by the gigabyte, the host, and the metric—often marking up the underlying S3 storage costs by 10x or 20x.
The industry is waking up. The era of the "unlimited budget" for observability is dead. We are entering the era of the Lakehouse for Logs. This is where IceGate comes in. IceGate isn't just another logging agent; it is a fundamental architectural shift. It is a native Rust engine designed to ingest, process, and commit observability data directly to S3 using Apache Iceberg tables.
By cutting out the middleman and leveraging the power of Rust and open table formats, we can achieve 90% cost savings while maintaining—and often exceeding—the performance of the "Big Box" observability vendors.
2. The Infrastructure Crisis: Why Rust is the Only Logical Choice
When you're building a system that needs to ingest millions of events per second across thousands of nodes, the choice of programming language is no longer a matter of preference; it's a matter of physics and economics.
The Problem with the Garbage Collector (GC)
Most legacy observability tools and even many modern ones (like those written in Go or Java) are handcuffed by a Garbage Collector. In a high-throughput ingest pipeline, memory allocation is constant. You're constantly creating strings, buffers, and objects for every log line.
In Go, the GC eventually has to "stop the world" or at least steal CPU cycles to clean up these objects. At low throughput, this is unnoticeable. At 500,000 events per second, the "GC jitter" becomes a nightmare. You see spikes in tail latency (P99), which forces you to over-provision your ingest nodes just to handle the jitter.
Rust, with its Zero-Cost Abstractions and Borrow Checker, allows us to manage memory deterministically. We don't have a GC. We allocate exactly what we need, often using arenas or reusing buffers, and then we drop it the moment it's no longer needed. This leads to a "flat" latency profile. A Rust ingest node can run at 95% CPU utilization without the fear of a sudden GC-induced death spiral.
SIMD and Data Parallelism
Modern observability is moving away from unstructured text towards structured JSON and Protobuf (OTLP). Parsing millions of JSON lines is CPU-intensive. Rust provides first-class support for SIMD (Single Instruction, Multiple Data) instructions. Using crates like simd-json, IceGate can parse logs at multi-gigabyte-per-second speeds on a single core.
Predictable Resource Density
In a cloud environment, you pay for what you use. If a Go-based ingestor needs 4GB of RAM to handle 100MB/s of logs because of heap overhead, and a Rust-based ingestor needs 256MB for the same throughput, the Rust version is not just "faster"—it's an order of magnitude cheaper to run at scale.
When we talk about Staff-level engineering, we're talking about Resource Density. How much work can we cram into a $15/month EC2 instance? With Rust, the answer is "significantly more."
The Mechanical Sympathy of Rust
To understand why Rust is the king of high-throughput infrastructure, we have to look at Mechanical Sympathy—the idea that the software should be designed with the hardware's constraints in mind.
In a logging ingestor, the bottleneck is usually one of two things: the Network Stack or the Memory Bus.
Async I/O and io_uring: Traditional synchronous I/O blocks a thread every time you wait for a packet. Go's
netpollerwas a revolution in its time, but it still introduces context-switching overhead. Rust, via thetokioandio-uringcrates, allows us to perform "Proactive I/O." We submit a request to the kernel and get notified only when the data is ready in a pre-allocated buffer. This eliminates the "copy" step between kernel space and user space.Cache Locality and Arena Allocation: In a language like Java, every
LogEventobject is scattered across the heap. When the processor needs to process a batch of logs, it has to fetch these objects from RAM, leading to "Cache Misses." In Rust, we use Memory Arenas (via crates likebumpalo). We allocate a single, contiguous block of memory for a batch of 1,000 logs. The CPU can then stream this data into its L1/L2 caches with near-perfect predictability.Zero-Copy Serialization: When IceGate receives an OTLP Protobuf message, it doesn't "parse" it into a new set of data structures. Using the
rkyvorprostcrates with specialized configurations, we can often view the raw bytes as if they were a structured object. We are essentially zero-copy from the network card to the Arrow buffer.
By the time a Go ingestor has finished its first GC cycle, the Rust ingestor has already committed three batches to S3 and is idling, waiting for more data. This isn't just a performance win; it's an operational stability win. In high-load scenarios, the Go ingestor’s latency becomes non-linear (the "hockey stick" curve), while the Rust ingestor’s latency remains a flat line until the NIC is saturated.
3. IceGate Architecture: Designing for Petabytes
IceGate is built on three core pillars: Stateless Ingest, Local Compaction, and Iceberg Commits.
The Ingest Layer
IceGate exposes a high-performance gRPC and HTTP/2 endpoint that is fully OTLP (OpenTelemetry Protocol) compliant. It uses the tokio runtime to handle tens of thousands of concurrent connections with minimal overhead.
Unlike traditional collectors that try to do everything (transform, filter, route), IceGate is focused on one thing: getting the data from the wire into a memory-mapped buffer as quickly as possible.
The Buffer and Compaction Engine
The "magic" of IceGate happens in its buffering strategy. Instead of writing every log line as a small file to S3 (which would bankrupt you in S3 API call costs), IceGate accumulates data in-memory using Apache Arrow record batches.
Arrow is a columnar memory format. By keeping the data in Arrow format while in-flight, IceGate can perform lightning-fast transformations (like adding metadata tags or filtering sensitive data) without the overhead of serialization/deserialization.
Once a buffer reaches a certain size (e.g., 128MB) or a time threshold (e.g., 30 seconds), IceGate triggers a background task to:
- Convert the Arrow batch into a compressed Parquet file.
- Upload the Parquet file to S3.
- Update the Iceberg Metadata.
The Metadata Layer: Apache Iceberg
This is where IceGate differs from simple "Log-to-S3" scripts. By using Apache Iceberg, we treat our logs as a first-class table.
Iceberg provides:
- ACID Transactions: Multiple ingestors can write to the same table without data loss or corruption.
- Hidden Partitioning: We can partition by
day,hour, or evenservice_idwithout the user having to manage directory structures manually. - Schema Evolution: As your logs change (new fields added, types changed), Iceberg handles the schema updates gracefully.
- Time Travel: Want to see what the state of your logs was 2 hours ago? Iceberg's snapshotting makes this trivial.
4. Deep Dive: Apache Iceberg on S3 - The Storage Revolution
For years, the industry was told that logs needed to be indexed in Elasticsearch or OpenSearch to be searchable. This created a massive storage and compute burden. You had to run "hot" nodes with expensive NVMe drives just to keep the indexes performant.
The End of the Indexing Era
IceGate takes a different approach. Instead of building expensive, memory-heavy inverted indexes (like Lucene), we leverage the massive parallel throughput of S3 and the efficiency of Parquet.
Modern query engines like Trino, Athena, and Rust's own DataFusion can scan Parquet files at incredible speeds. Because Parquet is columnar, if you only want to search for error_code, the query engine only reads that specific column from S3.
By using Iceberg's Manifest Files, the query engine knows exactly which files to skip based on the time range or the service name. This is called "Predicate Pushdown." We get 80% of the performance of a fully indexed system at 5% of the cost.
Implementation: The Rust Iceberg Stack
In IceGate, we don't use the Java-based Iceberg libraries. We use a native Rust implementation (leveraging the work being done in the iceberg-rust project). This allows us to maintain our "zero-overhead" promise.
// A simplified look at how IceGate handles a commit
async fn commit_batch(table: &mut Table, batch: RecordBatch) -> Result<()> {
let parquet_file = write_parquet(batch).await?;
let data_file = DataFile::builder()
.with_path(parquet_file.s3_path)
.with_format(FileFormat::Parquet)
.with_record_count(batch.num_rows())
// ... add statistics for predicate pushdown
.build();
let mut transaction = table.new_transaction();
transaction.append_data_file(data_file);
transaction.commit().await?;
Ok(())
}
This architecture allows IceGate to be entirely stateless. If an ingest node dies, another one picks up the work. The "source of truth" is always the Iceberg metadata on S3.
Hidden Internals: The Iceberg Manifest and Snapshot System
To understand how we replace an entire Elasticsearch cluster with a few S3 files, we must look under the hood of Apache Iceberg.
Iceberg is structured in layers:
- The Metadata File: The root of the table. It points to the current "Snapshot."
- The Manifest List: A list of all "Manifest Files" that make up a snapshot.
- The Manifest File: A list of actual "Data Files" (Parquet) and their statistics (min/max values for every column).
When a query comes in—say, SELECT * FROM logs WHERE service = 'auth-api' AND status = 500—IceGate’s query engine doesn't start by reading 10TB of data. It reads the tiny Manifest List. It identifies which Manifest Files contain data for the requested time range. Then, it looks at the statistics in those Manifest Files.
If a Manifest File says "The max value of service in these 100 Parquet files is api-gateway," the engine skips them entirely. This Metadata-level Filtering is what allows S3 to behave like a database.
Handling the "Small File Problem"
One of the biggest pitfalls of S3-based logging is creating millions of tiny files. S3 hates small files; they are slow to list and expensive to read. IceGate solves this with an Asynchronous Compactor.
While the ingest nodes are writing "Append-only" files every 30 seconds, a background worker (also written in Rust) periodically performs a "Rewrite Data Files" operation. It takes twenty 10MB files and merges them into a single, highly-optimized 200MB file, re-sorting them by timestamp for even better compression.
Because Iceberg supports Snapshot Isolation, this compaction happens while the system is live. Queries continue to see the old files until the moment the new compacted file is committed, at which point the switch is atomic. No downtime, no partial reads, no data loss.
5. The Economics of IceGate: A Brutal Comparison
Let's talk numbers. This is what gets the CTO's attention.
The SaaS Markup
Suppose you are ingesting 10 Terabytes of logs per day.
Datadog/Splunk Cost: These vendors typically charge between $0.10 and $0.25 per GB for ingestion and short-term retention.
- 10,000 GB $0.15 = *$1,500 per day.
- Monthly: $45,000.
- This doesn't include the costs for "Indexing" or "Long-term Archival."
IceGate (S3 + Compute) Cost:
- S3 Storage: Standard S3 is $0.023 per GB. With Parquet compression (often 5x-10x), your 10TB becomes 1.5TB.
- 1,500 GB $0.023 = *$34.50 per day.
- S3 API Calls: Using Iceberg's optimized commits, call costs are negligible (~$5/day).
- Compute (EKS/Rust): A 10TB/day ingest can be handled by a handful of c7g.xlarge instances.
- Compute cost: ~$20 per day.
- Total IceGate Cost: ~$60 per day.
- Monthly: $1,800.
Total Savings: $43,200 per month. $518,400 per year.
For a 10TB/day workload, you are essentially saving half a million dollars a year by switching to a native Rust/Iceberg stack. That is enough to hire two or three senior engineers.
The Hidden Costs of Legacy
It’s not just the SaaS bill. It’s the "Compliance Tax." When you use a SaaS vendor, you often have to pay extra for "HIPAA compliance" or "PCI compliance" because the data is leaving your VPC.
With IceGate, the data never leaves your environment. It stays in your S3 buckets, behind your IAM roles, encrypted with your KMS keys. You own the infrastructure, you own the data, and you own the cost.
6. Real-World Implementation: Beyond the Whiteboard
Building IceGate isn't just about writing a fast ingestor. It's about building a robust ecosystem.
Multi-tenancy and Isolation
In a large organization, different teams have different logging needs. IceGate handles this through Namespaced Iceberg Catalogs. Each team gets their own table, with their own retention policies and access controls.
Querying the Lake
One of the common pushbacks against "Log Lakes" is that they are hard to query. This is no longer true.
- For Developers: We provide a CLI tool called
ice-grep(written in Rust, obviously) that uses DataFusion to run SQL queries or regex filters against the S3 files directly. - For Dashboards: You can point Grafana at a Trino or Athena instance that is reading your Iceberg tables. You get the same "Point and Click" experience as Datadog, but at a fraction of the cost.
Schema Management
IceGate uses an "Infer and Evolve" strategy. When it sees a new field in a JSON log, it automatically updates the Iceberg table schema. If there is a type conflict (e.g., a field was a string and is now an integer), IceGate safely moves the data into a "dead-letter" column rather than dropping the logs or crashing the ingestor.
Building the Custom Query Engine: The Power of DataFusion
Standard SQL engines like Athena are great, but for a true Staff Engineer, they are often too generic. To provide a "Sub-Second" grep experience over petabytes, IceGate includes a custom query layer built on Apache DataFusion.
DataFusion is an extensible query engine written in Rust. It uses Arrow as its internal memory format (just like IceGate’s ingestor). We have extended DataFusion with custom "Object Store" implementations that are optimized for S3’s parallel nature.
// An example of a custom DataFusion plan in IceGate
let ctx = SessionContext::new();
ctx.register_table("logs", Arc::new(IcebergTable::new(s3_path).await?))?;
let df = ctx.sql("SELECT service, COUNT(*) FROM logs WHERE severity = 'ERROR' GROUP BY service").await?;
df.show().await?;
By embedding DataFusion directly into our CLI tool, we can perform Distributed Grep. When a developer runs a query, the CLI can spawn "Worker Lambdas" that each scan a portion of the Iceberg table in parallel. You can search 100TB of logs in 5 seconds for the cost of a few cents in Lambda execution time. This is the "Death of Splunk" in action.
Operational Excellence: Running IceGate at Scale
Deploying IceGate is a lesson in modern infrastructure. We don't use long-lived, stateful clusters. We use Spot Instances on EKS.
- Horizontal Scaling: We scale our ingest pods based on the
RequestPerSecondmetric from our Load Balancer. Since IceGate is stateless and starts in under 100ms (thanks to Rust), we can react to traffic spikes instantly. - Backpressure and Buffering: If S3 is experiencing a rare latency spike, IceGate uses a local WAL (Write-Ahead Log) on NVMe instance storage. It buffers the incoming logs locally and flushes them to S3 as soon as the connection is restored.
- Observability of Observability: We monitor IceGate using... IceGate. It emits OTLP traces of its own internal commit loops, allowing us to tune the buffer sizes and compaction intervals based on real-time performance data.
The Migration Path: Moving from the "Tax" to the "Lake"
You don't have to switch overnight. The beauty of IceGate being OTLP-native is that you can Dual-Write.
- Phase 1: Keep your existing Datadog/Splunk agent. Add an IceGate "Sidecar" or "Collector" that receives the same stream and writes it to S3.
- Phase 2: Verify the data. Compare the results of a Datadog search with an IceGate/DataFusion search.
- Phase 3: Reduce your SaaS retention from 30 days to 1 day. Use the SaaS for "Real-time Alerts" and IceGate for "Historical Investigation."
- Phase 4: Move your alerting logic to IceGate (using a simple Rust-based rule engine) and turn off the SaaS entirely.
This staged approach de-risks the migration and allows you to prove the ROI to your finance team at every step.
Security, Compliance, and the Governance of the Lake
In the legacy SaaS world, "Governance" often means clicking a checkbox in a UI and hoping the vendor's SOC2 report is accurate. In the IceGate world, Governance is code.
Fine-Grained Access Control (FGAC): Because our data is in Iceberg/Parquet, we can use AWS Lake Formation or a custom Open Policy Agent (OPA) sidecar to enforce row-level and column-level security. For example, a junior developer might be able to see the
messageandservicecolumns, but only a senior SRE can see theuser_iporpii_fields.PII Redaction at the Edge: IceGate's Rust ingestor includes a "Redaction Engine." Using high-performance regex (via the
regexcrate, which is essentially the gold standard for speed), we can scrub PII (emails, credit card numbers) before the data ever touches S3. This significantly reduces our compliance surface area.Immutable Auditing: Iceberg’s snapshot system is inherently an audit log. We can configure the S3 bucket with Object Lock in "Compliance Mode," making our logs legally immutable for 7 years. This satisfies even the most stringent regulatory requirements (FINRA, GDPR, etc.) without having to pay a "Compliance Premium" to a vendor.
Common Pitfalls: Lessons from the Trenches
Building and running a system like IceGate isn't without its challenges. Over the last year of development and deployment, we’ve learned a few hard lessons:
- The Metadata Bloom: If you commit too frequently (e.g., every 5 seconds), your Iceberg metadata will grow too large, slowing down queries. The sweet spot for most workloads is 30-60 seconds or 128MB of data.
- Clock Skew: In a distributed ingest system, clocks are never perfectly synced. IceGate uses a "V-Time" (Virtual Time) strategy to ensure that logs are ordered correctly in the Iceberg table even if an ingest pod's clock is off by a few hundred milliseconds.
- Schema Conflicts: Developers will change a field from a
stringto anobject. We learned to use "JSON-in-a-String" as a fallback for high-churn fields to avoid constant schema evolution overhead.
The Future: AI, Vectors, and Semantic Logging
As we look toward 2027, the role of logging is changing. We aren't just searching for strings; we are looking for patterns.
The next iteration of IceGate will include Native Vector Embeddings. As logs are ingested, we can use a lightweight Rust-based ML model to generate a vector embedding for every log line and store it in a companion column.
Imagine saying to your query engine: "Find me all logs that are semantically similar to this 'NullPointerException' in the checkout service."
By having the data in an open format like Iceberg, we aren't locked into whatever "AI features" a SaaS vendor decides to ship. We can bring our own models, our own compute, and our own innovations.
8. Final Thoughts: Reclaiming the Engineering Soul
For too long, we have treated observability as a service we buy rather than a system we build. We have accepted that "Logging is expensive" as a law of nature.
It is not.
Logging is only expensive because we have been using the wrong tools and the wrong business models. By moving to Native Rust and the Apache Iceberg Lakehouse, we are reclaiming our engineering budgets and our technical autonomy.
The "Death of Overpriced Logging" is not just about saving money. It's about building better, faster, and more secure systems. It’s about being an engineer again, instead of just a consumer of SaaS.
Antony Giomarx Staff Infrastructure Engineer April 2026
7. Conclusion: The Future is Native
The era of "Overpriced Logging" is coming to an end. We are moving away from proprietary, black-box observability and towards open, high-performance infrastructure.
IceGate represents the pinnacle of this movement. By combining the safety and speed of Rust with the scalability and openness of Apache Iceberg, we are giving power back to the engineers.
We no longer have to ask, "Can we afford to log this?" Instead, we can log everything, keep it forever, and query it instantly.
If you are a Staff or Principal Engineer looking at a multi-million dollar observability bill, it is time to stop paying the tax. It is time to look at the architecture. It is time for IceGate.
Antony Giomarx Staff Infrastructure Engineer April 2026