Back

May 9, 2022

Syncing text files between browser and disk using Yjs and the File System Access API

Michael Fester

@michaelfester

In this post, we look at the newly introduced File System Access API, which lets the browser read and write files on a computer. This enables apps to offer a browser-first experience, all while syncing its data with a folder on disk. We believe that the content you create should not be tied to the tool you use to create it. Software applications should, whenever possible, push for freeing their data, and make it available for other applications to use as well, without vendor lock-in.

This blog post, for instance, resides as a plain text file on my computer. It is written partly in Motif, partly in VS Code, and changes are immediately synced with the live webpage you are currently reading. This file, and hence this blog post, will stay with me for as long as I decide—it won't go away if one day I change tools.

By making data available as files with two-way sync, apps instantly become endowed with new, exciting features: data can be stored anywhere (GitHub, Dropbox, Google Drive, …) and version-controlled (e.g. Git); in the case of plain text data, it can be edited with your editor of choice (VS Code, Obsidian, …); content produced with other apps (a bunch of Markdown notes, a folder full of photos, …) can be easily and continuously integrated, no tedious “upload” or “import” steps required.

The point is: by freeing the data from the tool, apps actually become more powerful, because they are now part of a collective of tools working hand-in-hand, augmenting one another with what they do best (like the POSIX utilities). Software developers know this. Why is this design paradigm not more broadly adopted, for the benefit of everyone?

Two-way sync between the browser and the local file system is now possible thanks to the File System Access API.

With the emergence of browser-first apps, such as Figma and Google Docs, a new era of collaboration has seen the light. Everyone can just open up a browser, enter a URL, and start contributing. The barrier to entry has been lowered, and many more people can participate in the work and play. Browser-first apps offer a level of convenience that is unmatched in their standalone desktop counterparts. However, with cloud-based apps, data is typically stored in the cloud and controlled by the service provider, often in opaque and proprietary formats optimized for the tool itself (because why not? only the tool needs to understand the format!). Unless an API exposes the data, it is basically locked within the tool. And if the API is turned off, if the service stops working, or if the user moves on to another tool, the data is stuck in limbo.

With the File System Access API, we see an opportunity to come full circle: we can offer the convenience, availability, and collaborative aspects of a browser-first app, all while keeping the data free from the tool, unlocking its full potential. This is in line with the vision presented by Geoffrey Litt in his article Bring your own client. Let’s see how this can be achieved in practice for text files!

Towards hybrid browser-cloud-file data architectures

Here is a typical hybrid browser-cloud-file architecture (this is for instance the one Motif is based on):

In Motif, four data sources need to be merged: live clients via WebRTC, cloud storage, local storage, and now, the file system.

Specifically:

  • The client application can be accessed in a browser, no installation required. Like any web application, this lowers the barrier to entry to get started (users just need to open a browser and enter the URL of the app), and ensures broad availability across platforms and devices.
  • Continuing in the web app paradigm, the client synchronizes its data with a cloud database, making data accessible across clients, and enabling asynchronous collaboration and synchronization.
  • To make this data available instantly and while offline, browser storage, such as IndexedDB, can be used. In Motif, we use Replicache to sync local and remote states.
  • When two clients open the same document at the same time, they instantiate a WebRTC connection and exchange data in real-time.
  • Finally, if file sync is enabled, all content in the client is synced to a folder on the file system. Changes in the file system get seamlessly merged with the client, and vice-versa.

The challenge in such a setup is to make sure that no data gets lost, and that all clients eventually reach the same state. For instance, when two users edit the same part of a document, either in real time (data transits via WebRTC), or asynchronously (data transits via a cloud database), we expect that these changes get merged seamlessly.

In order to address this, we are using a Conflict-free Replicated Data Type (CRDT) as the underlying data model. More specifically, in this blog post, we are using the Yjs implementation, which is optimized for large documents, and in general, has great performance characteristics (another notable option is Automerge). But while the client apps, the cloud storage, and the local IndexedDB storage can hold the Yjs documents as CRDTs in their entirety, which is required for merging changes, the files on disk only hold their textual representations. With files, we lose the CRDT info, and thereby, the built-in merge capabilities.

Of course, we are not satisfied with a solution that simply overwrites changes. Not only do we risk losing data, e.g. if a change comes in from the file system and from a remote client at the same time, but it would also push us to add an extra layer of complexity in order to figure out what version comes in last (timestamps are hard to deal with in a distributed setup).

The solution that we present here turns out to be fairly straightforward, by simply adopting a “CRDT mindset” when thinking about the problem. If we can somehow manage to make the disk files behave as if they were CRDTs as well, we can then treat the file system as just another data source in our CRDT architecture, providing nothing but delta updates. As illustrated below, this can be achieved by keeping a version of the “last-write-to-disk” in a persistent cache, and computing the diff with the disk version as it comes in.

How it works

Let’s walk through the implementation. The full source is available on GitHub. You can skip to the sync part if you are already familiar with the File System Access API.

Preparing for write access

First, let’s make the browser client ready for reading and writing files. We need to check that the File System Access API is supported:

const isSupported = () => {
  return typeof window.showDirectoryPicker === 'function'
}

As of May 2022, the File System Access API is supported on Chrome 86+, Edge 86+ and Opera 72+.

Next, let’s set the root directory in which to sync our files, and grant read and write permissions:

const readWriteOptions = { mode: 'readwrite' }

const setRootDirectory = async () => {
  const handle: any = await (window as any).showDirectoryPicker()
  if (!handle) {
    return undefined
  }
  let granted = (await handle.queryPermission(readWriteOptions)) === 'granted'
  if (!granted) {
    granted = (await handle.requestPermission(readWriteOptions)) === 'granted'
  }
  return { handle, granted }
}

This function will trigger a browser dialog:

If permissions are successfully granted, we now have a directory handle that we can use to read from and write to the file system. For the sake of simpliticy, this blog post will only deal with syncing a single file. Handling multiple files, subfolders, renaming and moving adds an extra layer of complexity that is out of the scope of this post. Our repo contains helper functions though, which you can use to deal with these situations.

Let’s create a function for obtaining the handle to the file we want to sync our content with. As mentioned, we are only dealing with the case of a fixed file at the base of our root folder.

const getFileHandle = async (
    name: string,
    directoryHandle: FileSystemDirectoryHandle
): Promise<FileSystemFileHandle | undefined> => {
  for await (const handle of (directoryHandle as any).values()) {
    const relativePath = (await directoryHandle.resolve(handle)) || []
    if (relativePath?.length === 1 && relativePath[0] === name) {
      return handle
    }
  }
  return undefined
}

If the file is not present in the directory, let’s create it, using the following function:

const createFileHandle = async (
    name: string,
    directoryHandle: FileSystemDirectoryHandle
): Promise<FileSystemFileHandle | undefined> => {
  return await directoryHandle.getFileHandle(name, { create: true })
}

Finally, let’s create a function for writing content to our file:

const writeContentToFile = async (
  fileHandle: FileSystemFileHandle,
  content: string
) => {
  const writable = await (fileHandle as any).createWritable()
  await writable.write(content)
  await writable.close()
}

Syncing between data sources

Now that we can write to our file, we can set up the sync mechanism. The challenge lies in seamlessly merging different versions of the content, as we don’t just want to overwrite one version with another. Indeed, changes can come asynchronously from the cloud, another client over WebRTC, or from the file system. With Yjs, we have a framework for doing this, granted that we can work with the full CRDT. In the file system, however, we only have a raw content file, not the entire CRDT. Let’s see how we can mitigate this.

First, let’s create a Yjs document, which acts as the authorative source for our content in the client, and let’s insert some content programmatically.

import * as Y from 'yjs'

const doc = new Y.Doc()
doc.getText().insert(0, 'Hello world')

In practice, the editing of a Yjs document is done using an editor binding such as y-prosemirror or y-monaco. For an example using Monaco, check out our sample Next.js application on GitHub.

Next, let’s get a hold of the file on disk. We’ll use the name test-file.txt for this demo. If it doesn’t exist, we create it:

const name = "test-file.txt"

// We assume `directoryHandle` has been set, e.g. using the 
// `setRootDirectory` method defined above.
let fileHandle = await getFileHandle(name, directoryHandle)

if (!fileHandle) {
  // File is not present in the file system, so create it.
  fileHandle = await createFileHandle(name, directoryHandle)
}

Now, the trick for being able to leverage the CRDT features of our Yjs document is to write its content to a local persistent cache, in addition to writing it to disk. In that way, subsequent changes to the file system can be compared with the cached version. Here is our function for writing the file to disk, alongside the cache. We use IndexedDB for the cache, using the handy IDB-Keyval library.

import { set } from 'idb-keyval'

const updateFileContent = async (
  file: globalThis.File,
  fileHandle: FileSystemFileHandle,
  content: string
) => {
  // When we write to the file system, we also save a version
  // in cache in order to detect subsequent changes to the file.
  await writeContentToFile(fileHandle, content)
  await set(
    file.name,
    JSON.stringify({
      name: file.name,
      content,
      lastModified: file.lastModified })
  )
}

We update our previous code to include saving the content of our Yjs document to our newly created file:

const name = "test-file.txt"

// We assume `directoryHandle` has been set, e.g. using the 
// `setRootDirectory` method defined above.
let fileHandle = await getFileHandle(name, directoryHandle)

// Get the textual respresentation of the Yjs document
const content = doc.getText().toString()

if (!fileHandle) {
  // File is not present in the file system, so create it.
  fileHandle = await createFileHandle(name, directoryHandle)
  // Get the full file from disk.
  const newFile = await newFileHandle.getFile()  
  // Write the content of our document to disk for the first time,
  // and keep track of it in the cache.
  await updateFileContent(newFile, fileHandle, content)
  return
}

Now, after the initial file creation, and for any subsequent file write operations, we keep track of what is being written to disk. The next step is to handle the case where the file already exists. If the last modified timestamps between the file and the cache are equal, we can safely assume that no change has occured on disk, and the only change would come from the Yjs document, which we can write to disk as-is:

import { get } from 'idb-keyval'

// ...

const content = doc.getText().toString()

if (!fileHandle) {
  // ...
  return
}

// File exists on disk, so compare it with the last-write-cache.
// For this example, we assume that the cache exists.
const lastWriteCacheData = JSON.parse(await get(name))
const file = await fileHandle.getFile()

if (file.lastModified === lastWriteCacheData.lastModified) {
  // File has not changed on disk, so we can safely overwrite it with
  // the Yjs document content.
  await updateFileContent(file, fileHandle, content)
  return
}

And now to the crux of the implementation: a change has occured to the file on disk. This is the case if file.lastModified differs from lastWriteCacheData.lastModified. We can determine exactly what has changed by comparing the two versions. À fortiori, since any write to disk is tracked in the cache, the file version is more recent than the cache version. So, we want to track the diff between the two, and build a set of Yjs delta operations that will take us from the cached version to the newer disk version. This can be readily achieved using a diffing library such as jsdiff.

import * as Diff from 'diff'

type YDelta = { retain: number } | { delete: number } | { insert: string }

export const getDeltaOperations = (
  initialText: string,
  finalText: string): YDelta[] => {
  if (initialText === finalText) {
    return []
  }

  const edits = Diff.diffChars(initialText, finalText)
  let prevOffset = 0
  let deltas: YDelta[] = []

  // Map the edits onto Yjs delta operations
  for (const edit of edits) {
    if (edit.removed && edit.value) {
      deltas = [
        ...deltas,
        ...[
          ...(prevOffset > 0 ? [{ retain: prevOffset }] : []),
          { delete: edit.value.length }
        ]
      ]
      prevOffset = 0
    } else if (edit.added && edit.value) {
      deltas = [
          ...deltas,
          ...[{ retain: prevOffset }, { insert: edit.value }]
        ]
      prevOffset = edit.value.length
    } else {
      prevOffset = edit.value.length
    }
  }
  return deltas
}

We are now ready to merge the disk version of the file with the Yjs document (which may also have changed!). Once we have the delta operations, we can apply them using the applyDelta() function on the Y.Text value of the document.

import { get } from 'idb-keyval'

// ...

const content = doc.getText().toString()

if (!fileHandle) {
  // ...
  return
}

const lastWriteCacheData = JSON.parse(await get(name))
const file = await fileHandle.getFile()

if (file.lastModified === lastWriteCacheData.lastModified) {
  // ...
  return
}

// File has changed in the file system.

const fileContent = await file.text()
const lastWriteFileContent = lastWriteCacheData.content
const deltas = getDeltaOperations(lastWriteFileContent, fileContent)

if (deltas.length === 0) {
  // No difference between disk and last-write-cache versions, so
  // just write the Yjs document to disk.
  await updateFileContent(file, fileHandle, docContent)
  return
}

// A change has happened in the file, since it differs from the
// cached version. So we merge it with the Yjs document.
doc.getText().applyDelta(deltas)

const mergedContent = doc.getText().toString()
await updateFileContent(file, fileHandle, mergedContent)

That’s it! Thanks to the idempotent and commutative nature of the CRDT, we don’t need to worry about timestamps, or whether the Yjs document has changed (e.g. from a remote client) at the same time as the disk version. We can now simply start polling the file system at regular intervals, and Yjs will do the magic!

Demo

You can try out a demo running the Monaco editor in the browser and performing live sync on a single file here:

And here is a demo of how it looks in Motif, including full directory sync:

Current caveats

There are a few minor inconveniences which will hopefully be addressed as the File System Access API matures and becomes more broadly available.

  • Only latest versions of Chrome, Edge and Opera support it. Safari is starting to add support, but currently, it only allows writing to the origin private file system, and not to an actual folder that we can access outside of the browser.
  • The operations for renaming and moving a file or a folder are only supported when writing to the origin private file system. This makes the implementation tedious, especially for subfolders, since we need to manually create, copy and delete items. Copying, rather than moving, also poses the risk of running out of disk space. We don’t worry about this in Motif, since we’re dealing with text files, so à priori small payloads.
  • When the browser app is closed, the read and write permissions are revoked. This means that the next time the user opens the app and wants to perform file synchronization, they need to grant the permissions again. To make this process as painless as possible in Motif, we show a discrete notification at the bottom of the editor, asking to resume syncing. We have found this solution to work quite well in practice.

Despite these shortcomings, all in all, the sync feature works great for what it is mainly intended for: easily storing browser data in the file system, editing a file with another editor, and bulk importing files into the browser app by simply dropping them into a folder.

Next steps

For a fully functional browser-based IDE like Motif, we want to sync many files and folders, not just a single file. Although in this post we solely focus on the latter scenario, the former is conceptually the same. In addition to syncing, we also want to be able to handle:

  • Moving, renaming and deleting files and folders. In our YFS repo, you will find most of the methods required to fulfill this.
  • Background sync, so that all files are always up to date. For instance, if a user updates a global CSS file, another live user should immediately see the effect on the page they are working on. You would probably want to use a longer interval for syncing files other than the one that is currently open.
  • Opening files from outside of the app, by leveraging the Web Share Target API. In Motif, files can have dependencies on other files (e.g. an MDX file importing a JSX component defined in another file), so we want to keep track of the project that the file is in. In order to do that, we would need more context about the file, e.g. a root parent handle. We haven’t investigated much in that direction yet.

Source code

The source code for this library, and an example implementation using the Monaco editor, is available on GitHub.

Acknowledgements

Thanks to Kevin Jahns, Geoffrey Litt, Titus Wormer, Ellen Chisa, Nick Rutten, Remco Haszing, Chet Corcos, Paco Coursey, Chris Smothers, Ankur Goyal and Jeff Tang for their insights and valuable feedback.

Further reading

Interested in working on similar problems?

If you like working on CRDTs, text editors, JavaScript build systems, DM me (@michaelfester) on Twitter! If you are an Open Source contributor in these areas, we’d like to know, and we could fund you.


To keep up to date with our progress, make sure to follow @motifland on Twitter.

  • © 2022 Motif Land Inc.
  • This site is built with Motif.