System Design#

The system has two sub-services which can be deployed and scaled separately:

a scraper (similar in spirit to Prometheus’ blackbox_exporter) fetching information on remote hosts periodically and publishing the resulting payload to a Kafka topic.
a consumer subscribing to the topic, polling the latest messages and persisting them in Postgres.

For convenience, a console entrypoint is provided running both services in a single process.

Services#

The Python async runtime is used heavily. Services run indefinitely until keyboard interrupt or TERM/KILL signals are caught.

The scraper service spawns an infinite asynchronous iterator for each website. The iterator ticks with some fixed frequency. Scraped results are sent to a multiple-producer single-consumer asynchronous queue Queue. The latter is iterated on by a AIOKafkaProducer, with payloads being serialized and published to Kafka.
The persistance service wraps AIOKafkaConsumer on the one hand and an asyncpg connection pool on the other. Batching almost always increases performance, and the writer has two parameters controlling it: timeout-ms and max-records, whichever comes first. They have the same meaning as in the getmany() context.

We need to pay special attention to error, timeout and retry handling. Additionally, we may want to jitter our requests (adding some random noise), and apply some rate limiting (e.g. the scraping interval shouldn’t be lower than a second) in order to be nice to remote hosts. Ideally we would read the robots.txt file to know whether we’re allowed to scrape the site in the first place.

Payloads#

Initially, the payloads as transmitted as JSON. The topics themselves are versioned by environment name and message version, e.g. prod.scraped-metrics.v1 to make deployment of multiple versions and environments on the same resources easier.

In the next release the Avro schema in wms/metrics.avsc will be used to ensure forward and backward compatibility.

Observability#

prometheus-async is used to output service-level metrics. For instance, latency histograms, error counters and queue size gauges. Sentry can be configured for error tracking by providing the project’s DSN in the config.

In the future, distributed traces could be constructed and sinked to some backend as well and structlog could be used to make logs easier to parse by downstream consumers.

Development#

We strive to do TDD and run linters and tests on each commit using the Github CI. In the interest of time, the project was bootstraped committing to main. After version 0.2.0 feature branches are used to ensure that main reflects the code running in production.

Deployment#

Development and test environments use a docker-compose stack. When the Postgres container is run for the first time, an initialization script creates dev and test databases and users.

The staging and production environments use Aiven-managed cloud services. The services are provisioned using terraform (see the deployment section). This includes Postgres and Kafka instances, the related database and topic, but also users and ACLs.

Ideally, CI would set up a staging environment and execute end-to-end tests for each PR. When the PR is merged on main, changes to infrastructure would automatically be applied to the production infrastructure.

For staging and production environments, TLS certs are a hard requirement. They are stored in a configurable location (the certs sub-directory by default). After service provisioning has completed, the simplest ways to fetch secrets are the following:

Export them from the Terraform state into environment variables accessible in the context of the running Python command. This can be trivially done by CI. It is currently done in Bash in scripts/export-tf-outputs.sh.
Use an Aiven auth token to instantiate an aiven-client (in-process) that searches the resources and fetches the appropriate information pre-flight. This was not done in the interest of time (Terraform exposes the secrets essentially for free).

Terraforming multiple envs#

The simple solution is to create a project per environment. To provision multiple environments in the same project, ressources’ names need to be parametrize by environment and relevant tags added.

Migrations#

In all environments, the schema in the running Postgres instances must be up to date with the schema used by the library. Otherwise, the persistence service would just crash on startup. This is ensured by saving the current migration version to a migrations table in Postgres and executing new migrations on system startup automatically if needed.

Migrations are distributed as SQL DDL with the package in the migrations directory. Additionally, for each migration, a rollback plan is provided and tested. Updates are named <version>_<name>_update.sql and rollbacks <version>_<name>_rollback.sql.

Remark: The current automatic migration process does not bode well with rolling upgrades or a live system in production unless all changes to the schema are backwards-compatible forever. it means that a new version containing a breaking change takes prod down since previous (running) services start erroring out. This in turn means migrations have to be tested against all previous version potentially running in prod? Too much for a PoC.