The details of this article have been communicated to the bank, but after 6 months of silence, I’m assuming it is not an issue for them, and decided to release this (see timeline below).

I think they are breaking PSD2 regulation around strong authentication, but
I’m not an expert on that subject.

Introduction

       When I was a child, I had a 10€ allowance per month. I remember keeping a small paper with all those transactions, but also planning future investment. For example, I knew I had to save for 8 years to afford my driving license.

Fast forward 2017, student, new flat, and thus the start of the great accounting spreadsheet™.

After 6 years, it became The Humongous Accounting Spreadsheet™.
Turns out, a single spreadsheet is not the ideal tool to track your every expenses across multiple countries. So here we are, with My Own Accounting Tool®.

It is fancy-enough, auto-categorizes most transactions, and can display pretty graphs.

Problem is, I still have to record transactions manually.

  • When I’m lucky, It’s a curl gathered from Firefox (DevTools > copy as cURL).
  • For others, a wonky regex-based python script to parse statements.

My goal: automatically fetching transactions directly from my bank account. This should reduce input mistake and accounting errors. My bank should have an API right?

The official API

The bank seems to have a public API: https://developer.lcl.fr/.
But as far as I understand, one needs to sign an agreement with the authorities or something, before getting some kind of certificate to sign requests.. Not going down that path tonight!

This bank also offers a website, so unless it’s full SSR, they should have some API I can plug into.

The other API

A quick look at the network requests, and here we are: https://monespace.lcl.fr/api/*!

The most interesting routes seems to be:

  • https://monespace.lcl.fr/api/login
  • https://monespace.lcl.fr/api/login/keypad
  • https://monespace.lcl.fr/api/login/contract
  • https://monespace.lcl.fr/api/user/accounts?type=current&contract_id=XXXXXXXX
  • https://monespace.lcl.fr/api/user/<account-id>/transactions

Those should be enough to fetch my own banking information.

Step 1: Login

To access my own data, I need to login.
For some unknown reason, banks in France LOVE weird SeCuRe visual keypads.
This bank doesn’t deviate: a 6-digit pin is the only password you need.

This bank's keypad

First surprising element: no 2FA by default? This bank does provide one (prompt on a trusted device), but it is only required for a few specific operations. I tried login on a blank browser, on a phone, with a new IP, and still, only the 6-digit password.

⚠ When traveling abroad, I noticed 2FA was required on the web page once, logging in from an already trusted device.
Rented a VPN, and tried my script in a few locations in France and Europe, and 2FA was never required. Not sure of the heuristic they chose, but since I can login from an untrusted location, and untrusted device, seems weak.

The 2 important network requests during the login are:

  • https://monespace.lcl.fr/api/login
  • https://monespace.lcl.fr/api/login/keypad

When you load the page, a first GET request is sent to api/login/keypad.
Upon login, a POST request is sent to api/login.

⚠ I redacted some parts of the request samples. The reason is I don’t know what those are, and if they contain secrets I shall not share.

api/login/keypad GET request

{
    "keypad": "13236373539383433303XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
}
  • keypad: A long, apparently random, digit-only sequence (partially redacted).

api/login POST request

{
   "callingUrl" : "/connexion",
   "clientTimestamp" : 1692997262,
   "encryptedIdentifier" : false,
   "identifier" : "XXXXXXXXXXX",
   "keypad" : "030303939303XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
   "sessionId" : "00000000000000000000001"
}
  • clientTimestamp: timestamp of the request.
  • encryptedIdentifier: always false, not sure why. Maybe something for plain HTTP requests?
  • identifier: the customer number.
  • keypad: A long, digit-only sequence (partially redacted). Maybe a challenge response?
  • sessionId: some client-side value derived from the timestamp. Seems to accept all numerical values as long as it respects some format.

Digit mangling

A large random number received, some client-side process with a keypad, and a large random number sent back. Some kind of challenge-response? Not exactly.

The keypad parameter is composed of 2 parts:

  • 13236373539383433303: a sequence determining the order of the keys on the keypad.
  • XXXXXXXXX...: the random seed used to generate that order?

So what do my login request looks like with the code 011000 ?
"keypad": "030303939303XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"

The repetition pattern looks familiar.

03 03 03 93 93 03 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 0  0  0  1  1  0 ??

Yes, that’s the pin code, mangled digit by digit, and reversed. The mangling is a bit weird:

  • take the received keypad
  • reverse the string
  • parse digits, 2 by 2, as an hex value
  • take the last 10 pairs
  • interpret them as base-10
  • take the ascii char corresponding to the each value.
  • those are your keypad numbers

I spare you the JS handling the keypad, but here is the python code to login.

answer = get_json("https://monespace.lcl.fr/api/login/keypad")

keypad = answer['keypad']

# Weird mangling/obfuscation for the keypad values.
# The HEX digits, interpreted as base-10 are the keypad digits.
keys = [ chr(int(x, base=16)) for x in re.findall('..', keypad[::-1]) ][-10:]
seed = "".join([ chr(int(x, base=16)) for x in re.findall('..', keypad[::-1]) ][:-10])

password = input("Your 6 digit pin? ")
mangled = "".join([ str(keys.index(x)) for x in password ])
token = "".join([ str(hex(ord(x)))[2:] for x in (seed+mangled) ])[::-1]

payload = {
    'callingUrl': "/connexion",
    'encryptedIdentifier': False,
    'identifier': "<customer-id>",
    'keypad': token,
    'clientTimestamp': now,
    'sessionId': "<some-random-value>"
}

post_json("https://monespace.lcl.fr/api/login", payload)

Getting the transactions

Now that we are logged in, we want to list transactions.

  • Each transaction is tied to an account.
  • Each account is tied to a contract.
  • Each contract is tied to a user.

So to get my transactions, I need to get the contract, then get the account, and only then transactions.

The initial login request returns a few info:

{
    "accessToken": "<Bearer token>",
    "refreshToken": "<Refresh token>",
    "expiresAt": "<timestamp>",
    "multiFactorAuth": null,
    "userName": "<name>",
    "birthdate": "<birthdate>",
    "[...]"
    "contracts": [
        {
            "id": "<contract-id>",
            "[...]"
        }
    ],
}

As-is, the accessToken cannot be used to fetch transactions. Instead, it is used to get a second token. Token which authenticates requests made to a specific “contract”. I’m not sure how accounts are tied to “contracts”, but in my case, I have 1 contract tied to 1 account.

api/login/contract POST request

{
    "clientTimestamp": timestamp,
    "contractId": base64.b64encode(contract["id"].encode()).decode()[:-2]
}

Why is the contract ID base64 encoded? Maybe some code sharing with the user/accounts GET route?

api/login/contract response.

{
    "accessToken": "<another-token>",
    "refreshToken": "<refresh-token>",
    "expiresAt": "<timestamp>"
}

This access token can be used on 2 routes:

  • https://monespace.lcl.fr/api/user/accounts?type=current&contract_id=XXXXXXXX
  • https://monespace.lcl.fr/api/user/<account-id>/transactions

api/user/accounts GET request

This request takes 4 parameters:

  • type: the type of the contract/account to fetch?, Here set to current.
  • contract_id: the base64 encoded contract ID.
  • is_eligible_for_identity: false. Not sure what this is about.
  • include_aggregate_account: <boolean>

It returns some information about the fetched account:

{
    "total": "<balance-in-euro>",
    "accounts": [
        {
            "type": "current",
            "iban": "<the iban>",
            "amount": {
                "date": "2023-08-25T22:45:46.892+0200",
                "value": "<balance>",
                "currenty": "EUR"
            },
            "internal_id": "<internal-account-id>",
            "external_id": "<external-account-id>",
            "[...]"
        }
    ]
}

api/user/<account-id>/transactions GET request

This request takes 2 parameters:

  • contract_id: this time, the internal_id received in the previous request.
  • range: <int32>-<int32>. From-To range of transactions to fetch. 0 is the most recent transaction.
{
    "isFailover": "<boolean>",
    "accountTransactions": [
        {
            "label": "CB some shop",
            "booking_date_time": "1970-01-01T00:00:00.000Z",
            "is_accounted": "<boolean>",
            "are_details_available": "<boolean>",
            "amount": {
                "value": -5.32,
                "currency": "EUR"
            },
            "movement_code_type": "<code>",
            "nature": "<I/CARTE/VIREMENT SEPA RECU/PRELVT SEPA RECU XXX>"
        }
    ]
}
  • movement_code_type: not sure, sometimes absent, sometimes an int (like 948).
  • nature: seem to be a free-form field, as SEPA order text can be seen there.

Getting old transactions

The api/user/<account-id>/transactions request takes a range. But if this range contains any transaction older than 90 days, the request fails: 2FA is required to make such request.

Digging a bit, I found 2 other API routes:

  • api/user/documents/accounts_statements
  • api/user/documents/documents

Those routes have no limit on the dates.

WAIT, WHAT?

Yes, they do require 2FA to call https://monespace.lcl.fr/api/user/<account-id>/transactions for transactions older than 90 days, but PDF statements since the dawn of time? Sure, NO PROBLEM.

The returned values have this format:

[
    {
        "codsoufamdoc_1": "AST",
        "datprddoccli": "2020-12-02",
        "downloadToken": "<some-token>",
        "liblg_typdoc": "Some human-readable document title",
        "libsoufamdoc_1": "Some human-readable category"
    }
]

To download the PDF, a GET request with the downloadToken fetched in the previous request:

https://monespace.lcl.fr/api/user/documents/download?downloadToken=<token>

Final thoughts

No 2 factor authentication;

A 6-digit pin. Really?
Why isn’t 2FA enforced by default? Even my empty Twitter account is more secured.

Why is the pin code mangled?

Isn’t SSL enough to secure your payload?
This rot-13 like obfuscation really seems weak if that’s the worry.

Auth tokens remains valid for 21 days

The web session does auto-exit after ~30mn of inactivity.
But did you know the auth tokens remains valid for 21 days?

Anyway, I do have what I need to interoperate with my accounting application, and I can rest peacefully, knowing my personal information are safe 🙃.

All the information disclosed here are public, and freely accessible with any web browser. I was required to figure this out to build interoperability with my own software.

Disclosure timeline

  • 28-08-2023: found those weaknesses, documented them.
  • 01-09-2023: contacted the bank on twitter via private message to ask about this.
  • 04-09-2023: contacted the bank by email since the twitter message hasn’t been replied to.
  • 05-09-2023: received a twitter message saying “we received the email, we’ll reply”
  • 21-02-2024: No news. Same behavior observed. Published this article.

See comments

       This blog is simple: some .md files, generated to static HTML. No backend or complex CMS. It’s light, loads fast, and readable-enough on mobiles (except code-blocks). Versioning is done on git. But it had one drawback: “high” cost to publish.

Building is done using Jekyll, and then pushing files to an FTP server. Since I publish rarely, I had no warm setup. Sometimes my ruby installation was broken, sometimes some dependency were broken. Once built, pushing to the FTP was a mix of FTP fuseFS + rsync (My OVH hosting had to ssh/sshfs access). As always with manual intervention, error could happen!

Anyway, I had some free credits on GCP, so tried Cloud Builder (used it in the past to setup CIs), and quickly stopped. Goal was to simplify the whole process, and using GCP is not going the good direction.

Found out about Firebase, decided to give it a try (spoiler: blog is hosted on Firebase as of today). And it had everything I need: It’s fast, simple, and absolutely cheap for my use-case!

Deploying the website was simple:

$ firebase deploy

This combined with a GitHub workflow, spinning a ruby docker, building the website and pushing to firebase:

'on':
  push:
    branches:
      - master
jobs:
  build_and_deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Prepare tree
        run: 'mkdir build'
      - name: Build docker image
        run: 'cd docker && docker build -t builder . && cd ..'
      - name: Build website
        run: 'docker run -t --rm --mount type=bind,src=/home/runner/work/blog/blog,dst=/mnt/src --mount type=bind,src=/home/runner/work/blog/blog/build,dst=/mnt/output builder'
      - uses: FirebaseExtended/action-hosting-deploy@v0
        with:
          repoToken: '$'
          firebaseServiceAccount: '$'
          channelId: live
          projectId: blog-1234

This also brings another advantage: Github becomes my CMS. I can write a new article as long as I have some Github access. Which is quite convenient!


See comments

       While working on parallax mapping, somebody told me about a cool presentation: Sparse virtual textures. The idea is quite simple: reimplement pagination in your shaders, allowing you to have infinite textures while keeping the GPU memory usage constant.

Goal was set: add SVT support to my renderer!

Step 1 - Hand-made pagination

Pagination overview

To understand how SVT works, it is useful to understand what pagination is.

On most computers, data is stored in the RAM. RAM is a linear buffer, and its first byte is at the address 0, and the last at address N.

For some practical reasons, using the real address is not very convenient. Thus some clever folks invented the segmentation, which then evolved into pagination.

The idea is simple: use a virtual address, that is translated by the CPU into the real RAM address (physical). The whole mechanism is well explained by Intel1

This translation possible thanks to pagetables.

Translating every addresses into a new independant one is costly and not needed. That’s why they divided the whole space in pages. A page is a set of N contiguous bytes. For example, on x86, we often talk about 4kB pages.

What the CPU translate are page addresses. Each block is translated as a contiguous unit. The internal offset remains the same. This means for N bytes, we only have to store N/page_size translations.

pagination recap

Here on the left you have the virtual memory, divided in 4 blocks (pages). Each block is linearly mapped to an entry in the pagetable.

The mapping can be understood as follows:

  • Take your memory address.
    • adress = 9416
  • Split it into a page-aligned value and the rest.
    • 9416 => 8192 + 1224.
    • aligned_adress = 8192
    • rest = 1224
  • Take the aligned value, and divide it by the page size.
    • 8192 / 4096 = 2
    • index = 2
  • This result is the index in the pagetable.
  • Read the pagetable entry at this index, this is your new aligned address:
    • pagetable[2] = 20480
  • Add the rest back to this address:
    • physical_address = 20480 + 1224
  • You have your physical address.

Adding the page concept to the shader

To implement this technique, I’ll need to:

  • find which pages to load
  • load them in the “main memory”
  • add this pagetable/translation technique.

This could be done using compute shaders and linear buffers, but why not use textures directly? This way I can just add a special rendering pass to compute visibility, and modify my pre-existing forward rendering pass to support pagetables.

First step is to build the pagetable lookup system. This is done in GLSL:

  • take the UV coordinates
  • split them into page-aligned address, and the rest
  • compute page index in both X and Y dimensions
  • lookup a texture at the computed index (our pagetable)
  • add to the value the rest
uv coordinates
Showing UV coordinates
page aligned UVs
Showing page-aligned UV coordinates

Computing visibility

The other advantage of pagination is the ability to load/unload parts of the memory at runtime. Instead of loading the whole file, the kernel only loads the required bits (pages), and only fetch new pages when required.

This is done using a pagefault:

  • User tries to access a non-yet loaded address.
  • CPU faults, and send a signal to the kernel (page fault).
  • The kernel determines if this access is allowed, and loads the page.
  • Once loaded, the kernel can resume the user program.

This mechanism requires hardware support: the CPU knows what a pagetable is, and has this interruption system. In GLSL/OpenGL, we don’t have such thing. So what do we do when interrupts don’t exits? We poll!

For us, this means running an initial rendering pass, but instead of rendering the final output with lights and materials, we output the page addresses. (Similar to the illustration image seen above).

This is done by binding a special framebuffer, and doing render-to-texture. Once the pass completed, the output texture can be read, and we can discover which pages are visible.

For this render pass, all materials are replaced with a simple shader:

#version 420 core

/* material definition */
uniform float textureid;
/* Size of a page in pixels. */
uniform float page_size;
/* Size of the pagetable, in pixels (aka how many entries do we have). */
uniform float pagetable_size;
/* Size in pixels of the final texture to load. */
uniform float texture_size;
/* Aspect ratio difference between this pass, and the final pass. */
uniform float svt_to_final_ratio_w; // svt_size / final_size
uniform float svt_to_final_ratio_h; // svt_size / final_size

in vertex_data {
    vec2 uv;
} fs_in;

out vec4 result;

/* Determines which mipmap level the texture should be visible at.
 * uv: uv coordinates to query.
 * texture_size: size in pixels of the texture to display.
 */
float mipmap_level(vec2 uv, float texture_size)
{
    vec2 dx = dFdx(uv * texture_size) * svt_to_final_ratio_w;
    vec2 dy = dFdy(uv * texture_size) * svt_to_final_ratio_h;

    float d = max(dot(dx, dx), dot(dy, dy));
    return 0.5f * log2(d);
}

void main()
{
    /* how many mipmap level we have for the page-table */
    float max_miplevel = log2(texture_size / page_size);

    /* what mipmap level do we need */
    float mip = floor(mipmap_level(fs_in.uv, texture_size));

    /* clamp on the max we can store using the page-table */
    mip = clamp(mip, 0.f, max_miplevel);

    vec2 requested_pixel = floor(fs_in.uv * texture_size) / exp2(mip);
    vec2 requested_page = floor(requested_pixel / page_size);

    /* Move values back into a range supported by our framebuffer. */
    result.rg = requested_page / 255.f;
    result.b = mip / 255.f;

    /* I use the alpha channel to mark "dirty" pixels.
     * On the CPU side, I first check the alpha value for > 0.5,
     * and if yes, consider this a valid page request.
     * I could also use it to store a "material" ID and support
     * multi-material single-pass SVT. */
    result.a = 1.f;
}

Once the page request list retrieved, I can load the textures in the “main memory”.

The main memory is a simple 2D texture, and page allocation is for now simple: first page requested gets the first slot, and so on until memory is full.

main memory texture
“Main memory” texture

Once the page allocated, I need to update the corresponding pagetable entry to point to the correct physical address. This is done by updating the correct pixel in the pagetable:

  • R & G channels store the physical address.
  • B is unused.
  • A marks the entry as valid (loaded) or not.
pagetable
Pagetable texture

Rendering pass

The final pass is quite similar to a classic pass, except instead of binding one texture for diffuse, I bind 2 textures: the pagetable, and the memory.

  • bind the 3D model
  • bind the GLSL program
  • bind the pagetable and main-memory textures.

At this stage, I can display a texture too big to fit in RAM & VRAM.

Step 2: MipMapping

If you look at the previous video, you’ll notice two issues:

  • Red lines showing up near the screen edges.
  • Page load increase when zooming out.

First issue is because texture loading doesn’t block the current pass. This means I might request a page, and not have it ready by the time the final pass is ran. I could render it as black, but wanted to make it visible.

The second issue is because I have a 1:1 mapping between the virtual page size and the texture page size. Zooming out to show the entire plane would require loading the entire texture. Texture which doesn’t fit in my RAM.

The solution to both these issues are mipmaps.

  • A page at mipmap level 0 covers page_size pixels.
  • A page at mipmap level 1 covers page_size * 2 pixels
  • A page at mipmap level N covers the whole texture.

Now, I can load the mipmap level N by default, and if the requested page is not available, I just go up in the mip levels until I find a valid page.

Adding mipmaps also allow me to implement a better memory eviction mechanism:
I can now replace 4 pages with one page a level above.
So if I’m low on memory, I can just downgrade some areas, and save 75% of my memory.

Finally, MipMapping reduces the bandwidth requirements: if the object is far, why load the texture in high resolution? A low-resolution page is enough:

  • less disk load.
  • less memory usage.
  • less latency (since there is less pages to load).
physicaladdresses with MipMapping
Showing physical addresses with MipMapping

Step 3: Complex materials

The initial rendered had PBR materials. Such material had not only an albedo map, but also normal and roughness+metallic maps. To add new textures, several options:

  • New memory textures, new pagetable texture, new pass.
  • simple
  • requires an additional pass. This is not OK.

  • Same memory texture, same pagetable texture.
  • Each page contains in fact the N textures sequentially. So when one page is loaded, N textures are queried and loaded.
  • Easy to implement, but I have to load N textures.

  • Same memory texture, multiple pagetable textures.
  • pagetables are small, 16x16 or 32x32. Overhead is not huge.
  • I can unload some channels for distant objects (normal maps by ex).
  • Drawback is I have now N*2 texture sampling in the shader: one for each texture and its associated pagetable.

Because I like the flexibility of this last option, I chose to implement it. In the final version, each object has 4 textures:

  • memory (1 mip level)
  • albedo pagetable (N mip levels)
  • roughness/metallic pagetable (N mip levels)
  • normal pagetable (N mip levels)

In the following demo, page loading is done in the main thread, but limited to 1 page per frame, making the loading process very visible.

  • Bottom-left graph shows the main memory.
  • Other graphs show the pagetables and their corresponding mip-levels.

Page request : subsampling, random and frame budget.

For each frame, I need to do this initial pass to check texture visibility. Reading this framebuffer on the CPU between each frame is quite slow, and for a 4K output, this is prohibitively expensive.

The good news is: I don’t need a 4K framebuffer in that case! Pages are covering N pixels, so we can just reduce the framebuffer size and hope our pages will still be requested!

The demo above is using a 32x32 framebuffer. Which is very small. If done naïvely, this wouldn’t work: some pages would be caught between 2 rendered pixels, and never loaded.

missing pages
8x8 framebuffer, no jitter.
 

A way to solve that is add some jitter to the initial pass. The page request viewpoint is not exactly the camera’s position, but the camera’s position + some random noise.

This way, we can increase coverage without increasing the framebuffer size.

missing pages
8x8 framebuffer, jitter.
  1. See Intel Architectures Developer’s Manual: Vol. 3A, Chapter 3 


See comments

       I never experimented with machine learning or denoising. I guess having obscure matrices combined together to produce some result scared me a bit.. Surprising for someone who loves computer graphics… 🙃
After failing an interview for an ML-related position (surprising?) I thought enough is enough, time to play catch-up!

For this project, I started with the basics: Andrew NG ML course. After a couple days — and obviously becoming the greatest ML expert in the world — I decided to tackle the easiest problem ever: image denoising!

The goal

Denoising is a complex field, and some very bright people are making a career out of it. Not my goal!

Here I’ll try to explore some classic denoising techniques, implement them, and once used to some of the problems, build a custom model to improve the result.

The input:

challenge image

I believe this should be a good candidate:

  • has a flat shape to check edge preservation.
  • has some “noise” to keep (foliage).
  • has some small structured details (steel beams).
  • has smooth gradients (sky).

Step 1 - sanity check

pixel line

From Wikipedia:

noise is a general term for unwanted […] modifications that a signal may suffer

The graph above represents a line of pixels being part of a smooth shade. In red are 2 bad pixels. They are bad because they interrupt the smoothness of our graph, and thus are perceived as noise.

How can we remove some outliers in that case? Averaging! Each pixel value is averaged in regard to its neighbors. In this case, this would help reduce perceptible noise.

  foreach x, y in image
    neighbors = extract_window_around(image, x, y, window_size=10)
    res = average(neighbors)
    image.set(x, y, res)

smooth, before & after

But in real life, that’s terrible..

real, before & after

The reason for this poor performance is we don’t discriminate valid details from noise. We loose our edges, and all details are lost.

Step 3 - Better average - YUV vs RGB

The previous image was generated by averaging RGB values using a 10-pixels sliding window. Because it was averaging RGB values, it mixed colors. As result, edges were blurred in a very perceptible way, leading to an unpleasant result.

YUV is another color representation, splitting the channels not as red, green, and blue, but color, and luminosity. Colors are represented using polar coordinates, and luminosity is a single linear value.

If we look at the sky, the noise doesn’t seem to alter the color a lot, only the brightness of the blue. So averaging using the same window, but only on the luminance component should give better results:

yuv, smooth yuv, real

Step 4 - selective average

Using YUV vs RGB helped: the sky looks fine, and the green edges look sharper. Sadly, the rest of the image looks bad. The reason is that I still use the same window size for the sky and the tower.

I can improve that solution using a new input: an edge intensity map. Using the well known Sobel operator I can generate the list of areas to avoid.

  edge_map = sobel(image)
  foreach x, y in image
    window_size = lerp(10, 1, edge_map.at(x, y))
    neighbors = extract_window_around(image, x, y, window_size)
    res = average(neighbors)
    image.set(x, y, res)

edge, real

  • ✅ The square edges are preserved.
  • ✅ The sky blur is gone
  • ✅ The Eiffel Tower’s edges seem preserved.
  • ❌ Artifacts visible in the sky (top-right)
  • ❌ The foliage texture is lost.
  • ❌ The metallic structure lost precision.
  • ❌ The grass mowing pattern is completely lost.

Step 5 - ML-based noise detection

In the previous step, I tried to discriminate areas to blur and keep as-is. The issue is my discrimination criteria: edges. I was focusing on keeping edges, but lost good noise like the foliage.

So now I wonder, can I split good noise from bad noise using a classification model?

  foreach x, y in image
    window = extract_window_around(image, x, y, window_size)
    bad_noise_probability = run_model(window)
    blur_window_size = lerp(1, 10, bad_noise_probability)
    res = average_pixels(image, x, y, blur_window_size)
    image.set(x, y, res)

For this model, I tried to go with a naïve approach:

  • select a set of clean images
  • generate their noisy counterpart in an image editor
  • split these images in 16x16 pixel chunks.

model training set extraction

Those would represent my training & test set (6000 items and 600 items). The goal is now from a 16 pixel window, determine if the pixel belongs to noise, or belongs to some details.

Then, I would iterate over my pixels, extract the 16x16 window around, run the model on it, and use this probability to select my blur window. My guess is that we should now be able to differentiate foliage from sky noise.

Here is the model output: in red the parts to clean, in black the parts to keep.

model output

And here is the output:

final result

  • ✅ Edges are preserved.
  • ✅ Steel structure is clear in the middle.
  • ✅ Left foliage looks textured.
  • ❌ Right foliage shadows are still noisy.
  • ❌ Some areas of the steel structure are blurred.
  • ❌ Sky has artifacts.

The model training set is composed of only ~6000 chunks extracted from 4 images (2 good, 2 noisy). Training the same model on a better dataset might be a first solution to improve the noise classification.

This result seems better than the bilateral filtering, so I guess that’s enough for a first step into the ML world. I will stop there for now, and move on to the next project!


See comments

Some friends were registered to this CTF, and since I had some days off, I decided to work a bit on one RE exercise.

The binary is called BadVM:

[nathan@Jyn badvm]$ ./badvm-original
### BadVM 0.1 ###

Veuillez entrer le mot de passe:
toto
Ca mouline ...
Plus qu'un instant ... On avait la réponse depuis le début en faite :>
Perdu ...

It is a stripped, ELF 64 PIE binary. Time to start Binary Ninja. This binary has no anti-debug, nor packing techniques. Just some calls to sleep. Once these calls NOPed, we can start reversing the VM.

The VM is initialized in the function I called load_vm (0xde6). Then, the function at 0xd5f is called, let’s call it vm_trampoline.

This function will choose the next instruction to execute. Load it’s address in rax and call it. vm_trampoline is called at the end of each instruction. Thus, each instruction is a new entry in the backtrace.

This means, when returning from the first call to vm_trampoline, we can read the result and return it. This takes us back to load_vm, and result is checked.

In case of an invalid character in the password, we have an early-exit. Input is checked linearly, no hash or anything, Thus instruction counting works well.

Since I was on holidays, I decided to experiment a bit with lldb, and write a instrument this VM using its API.

Reversing the VM

This VM uses 0x300 bytes long buffer to run. Some points of interest:

  • 0x4: register A (rip)
  • 0x5: register B
  • 0xFF: register C (result)
  • 0x2fc: register D
  • 0x2fe: register E (instruction mask?)

  • 0x32: password buffer (30 bytes)
  • 0x2b: data buffer (xor data, 30 bytes)
  • 0x200: data start (binary’s .data is copied in this area)

Instruction are encoded as follows:

opcode

To select the instruction, the VM contains a jump-table.

jump-table

Here one of the instructions (a ~GOTO):

instruction

Final note: each instruction/function has the following prototype:

prototype

Instrumenting using LLDB

This VM does not check its own code, thus we can freely use software breakpoints. The code is not rewritten, thus offsets are kept. This allow us to simply use LLDB’s python API to instrument and analyse the VM behavior.

First step, create an lldb instance:

def init():
    dbg = lldb.SBDebugger.Create()
    dbg.SetAsync(True)
    console = dbg.GetCommandInterpreter()

    error = lldb.SBError()
    target = dbg.CreateTarget('./badvm', None, None, True, error)
    # check error

    info = lldb.SBLaunchInfo(None)
    process = target.Launch(info, error)
    print("[LLDB] process launched")

Now, we can register out breakpoints. Since vm_trampoline is called before each instruction, we only need this one:

    target.BreakpointCreateByAddress(p_offset + VM_LOAD_BRKP_OFFSET)

Now, we can run. To interact with the binary, we can use LLDB’s events. Registering a listener, we can be notified each time the process stops, or when a breakpoint is hit.

listener = dbg.GetListener()
event = lldb.SBEvent()

if not listener.WaitForEvent(1, event):
    continue

if event.GetType() != EVENT_STATE_CHANGED:
    # handle_event(process, program_offset, vm_memory, event)
    continue

regs = get_gprs(get_frame(process))
if regs['rip'] - program_offset != address:
    print("break location: 0x{:x} (0x{:x})".format(
          regs['rip'] - program_offset, regs['rip']))

To read memory, or registers, we can simply do it like that

process.ReadUnsignedFromMemory(vm_memory + 0, 1, err),

process.selected_thread.frame[frame_number].registers
# registers[0] contains general purpose registers

Now we can implement a pretty-printer to have “readable” instructions. Once everything together, we can dump the execution trace:

mov [0x00], 0xff
mov [0x01], 0x01
mov tmp, [0x00]  	# tmp=0xff
mov [tmp], [0x01]	# src=0x1
mov [0x00], 0x0b
mov [0x01], 0x1d
mov tmp, [0x00]  	# tmp=0xb
mov [tmp], [0x01]	# src=0x1d
mov [0x01], 0x0b
mov tmp, [0x01]  	# tmp=0xb
mov [0x00], [tmp]	# [tmp]=0x1d
mov r5, [0x00]
sub r5, [0x0a]   	# 0x1d - 0x0 = 0x1d
if r5 == 0:
    mov rip, 0x2d
mov [0x01], 0x0a
[...]

Now, we can reverse the program running in the VM:

def validate(password, xor_data):
    if len(password) != len(xor_data):
        return -1

    D = 0
    for i in range(len(xor_data)):
        tmp = (D + 0xAC) % 0x2D
        D = tmp
        if xor_data[i] != chr(ord(password[i]) ^ tmp):
            return i

    return len(xor_data)

And we get the flag:

SCE{1_4m_not_4n_is4_d3s1yn3r}

Conclusion

This VM has no anti-debug, packing or anything special. But it was a funny binary to reverse. To instrument the VM, lldb is useful, but using DynamiRIO would be a more elegant method.


See comments