Worksheets Technical specs¶

Scope:

This document lays out challenges and potential solutions to worksheets from a technical perspective.
This document does not concern itself with any UX issues.

Status:

This document is in early draft state. No decisions have been made yet!

Book List Example¶

We use the following example in a number of places throughout the document:

Query:

This is the SQL that is generated by the query section of the worksheet. Exactly how we got this SQL doesn’t really matter here. It could have come from a UI-based query builder or could have been typed out manually by the user.

with
  genres as (
    select book_id, array_agg(genres.id) as genres
    from books_genres
    join genres on books_genres.genre_id = genres.id
    group by book_id
  ),
  copies as (
    select book_id, count(*) as copies
    from items
    group by book_id
  )
select
  books.id as book_id,
  books.title,
  books.author_id as author,
  authors.url as author_url,
  genres.genres,
  coalesce(copies.copies, 0) as copies
from books
left join authors on authors.id = book.author_id
left join genres on books.id = genres.book_id
left join copies on books.id = copies.book_id;

Side note

In Querydown this would be:
#books
$ id -> book_id
$ title
$ author_id -> author
$ #genres.id%list -> genres
$ #items -> copies
With a the worksheets having a modular approach to queries and displays, my hope is that a Mathesar extension could eventually provide a Querydown-based query editor. This would be a really fast way to build powerful worksheets!

Raw query results:

This is the raw data returned by Postgres.

book_id title author author_url genres copies

921 'Foo' 5 NULL {101,102} 10

32 'Bar' 6 'http://example.com' {} 0

1007 'Apple' 6 'http://example.com' {103} 2
Sheet display:

Here’s (roughly) how the sheet would appear to the user.

book_id title author author_url genres copies

921 Foo Alice Davis NULL Adventure Fantasy 10

32 Bar Bob Scott http://example.com 0

1007 Apple Bob Scott http://example.com Non-fiction 2

OIDs vs names¶

🤔 Challenges:

Do worksheets reference PostgreSQL tables and columns using names or OIDs?

If we use OIDs, how do we allow the user to edit the SQL using names?

TODO

Query safeguards¶

🤔 Challenges:

Do we want to prevent users from manually executing DML and DDL in their queries?

If so, how?

Are temporary tables okay?

TODO

Record summaries¶

🤔 Challenges:

Where is the record summary template stored?

How does the worksheet fetch the record summary data? (We need to handle direct FK columns as well as array aggregates.)

Display settings¶

Two types of columns can be configured to display record summaries:
- Scalar FKs: columns with source data having a single-column FK.
- Aggregate PKs: array aggregates of single-column PK values.
The record summary can be turned off in the display settings too. When it’s off, the raw PK values will display.
By default: record summaries are enabled wherever possible.

In the Book List Example, this is how we get record summaries for author and genres by default, without getting record summaries for id.
Limitations: Record summaries cannot be configured for
- Multi-column FKs (because they point to more than one PK column).
- FK columns that point to non-PK columns.

Record summary templates¶

The record summary template is stored in the worksheet’s display settings, and is associated with a column in the worksheet.
ⓘ Note:

Before worksheets, Mathesar associated record summaries with each table so that all FKs which referenced the same table would automatically use the same record summary. With worksheets, it is not (yet?) possible to configure a default per-table record summary template to always be used for references to that table. This behavior simplifies some things, but also has the following consequences:
- In some cases, there might be some more tedium associated with creating new worksheets because you can’t easily re-use a record summary template that you created elsewhere.
- The Record Page can’t show a record summary for the record.
I think these are acceptable tradeoffs though. The benefit of the worksheet approach is that different worksheets can have different record summary templates to refer to the same table.
Each record summary template has one and only one “source table” — the table which contains the PK column on which the record summary is operating.
The sheet can display record summaries even when no template is explicitly configured by the user. In this case it will be smart enough to auto-generate a template that works well in most cases.
- The auto-generated template is not persisted — it is computed on the fly. This allows us to update the smart logic used to auto-generate these templates.
- The smart logic works as follows:
  
  TODO
In the Book List Example:
- By default the “author” column shows a record summary and uses smart logic to formulate the following template on the fly:
```
{first_name}
```
  (The source table for this record summary is authors, so column references pertain to columns in that table.)
  
  The worksheet is then customized to store this improved template, which is persisted within the worksheet:
```
{first_name} {last_name}
```
- The “genres” column also shows record summaries by default. With no template being explicitly configured, the sheet is smart enough to use the following template:
```
{genre.name}
```
  The source table for this record summary is books_genres which does not have a name column. But the template is able to use dot notation to refer to a field in the directly related genres table.
The precise syntax of the record summary template still needs a bit more specification with regard to quoting and escaping.

Fetching record summary data¶

Here’s how we fetch the record summary data for the Book List Example.

When the search query runs, the service layer stores the results in a temporary table.
Before closing the transaction, the service layer issues additional queries using this temporary table. For the sake of this example, we’ll call this temporary table results.
The additional queries look like this:
- This query fetches the data necessary to render all record summaries within the “authors” column.
```
select
  authors.id,
  authors.first_name,
  authors.last_name
from results
join authors on authors.id = results.author
group by authors.id
```
  The first_name and last_name columns are included because they are present in the record summary template.
- This query fetches the data necessary to render all the record summaries within the “genres” column
```
with ids as (select distinct unnest(genres) as ids from results)
select
  books_genres.id,
  genres.name
from ids
join books_genres on books_genres.id = ids.id
left join genres on genres.id = books_genres.genre_id
```
  The genres.name column is included because it’s present in the record summary template, and we left-join the genres table into the query to access that data.

DML¶

🤔 Challenges:

How do we do DML from within a worksheet?

Metadata¶

Most of the time, Mathesar is able to use static analysis of SQL to determine metadata about the columns in a query’s result set. This metadata gives us a clear picture of where the columns came from and what their types are. For some queries, Mathesar won’t be smart enough to trace the origin of every column.

In the Book List Example, we’d get this metadata:

{
  "book_id": {
    "classification": "pk_cell",
    "table": "books",
    "column": "id"
  },
  "title": {
    "classification": "data_cell",
    "table": "books",
    "column": "title",
    "pk_lookup_reference": [ "book_id" ]
  },
  "author": {
    "classification": "data_cell",
    "table": "books",
    "column": "author_id",
    "pk_lookup_reference": [ "book_id" ]
  },
  "author_url": {
    "classification": "data_cell",
    "table": "author",
    "column": "url",
    "pk_lookup_reference": [ "author" ]
  },
  "genres": {
    "classification": "pk_array",
    "table": "books_genres",
    "conditions": [
      {
        "column": "book_id",
        "equals": {
          "query_column": "book_id"
        }
      }
    ]
  },
  "copies": {
    "classification": "aggregate_count",
    "table": "items",
    "conditions": [
      {
        "column": "book_id",
        "equals": {
          "query_column": "book_id"
        }
      }
    ]
  }
}

The column metadata gives the worksheet enough information to formulate INSERT, UPDATE, and DELETE queries in most cases.
The primary key column(s) from the origin table must be present in the result set for the worksheet to be able to perform cell updates.

Update examples¶

In the Book List Example, we want to make some changes to the first row…

In the title column, we can change “Foo” to “FOO”. This is an UPDATE query on the authors table.
In the author and author_url, we can make the same kinds of edits.
In the genres column, we can remove “Fantasy”. This is a DELETE query on the books_genres table.
In the genres column, we can add “Sci-fi”. This is a INSERT query on the books_genres table. It uses the genres PK value from the “Sci-fi” genre the user selected via the record selector.
In the genres column, we can change “Non-fiction” to “Nonfiction”. This is an UPDATE query on the genres table.

There’s still a lot of UX to figure out here. Here’s one UX idea:
- The user will be given the option to edit any fields referenced within the record summary. In this case the record summary only references one field. But theoretically it could reference multiple. And the record summary could also include static text in the template too. When the user edits, they don’t edit the text exactly as it’s shown in the record summary — they edit the text from a specific field, one field at a time. So then the worksheet gets a new value for genres.name. The worksheet submits a request to update genres.name. The worksheet doesn’t know the genres PK value, but it does know the books_genres PK value and that should be enough to tell the backend to update a specific genres record, changing the name value.
From the copies column, we might later implement some UI for editing too! Ideas:
- Hyperlink to a new worksheet (to be opened in a separate tab) which shows all those counted items (with the right filters applied).
- Do something similar in a modal
These approaches would allow users to quickly add/delete records that would in turn affect the total count.

Reactivity¶

🤔 Challenges:

How do dependent cells refresh when DML changes are made?

Short term solution¶

Worksheets must be refreshed manually in order to refresh dependent cells.

Long term solution¶

We use the column metadata to build a primitive reactivity graph that allows the front end to update dependent UI without full refresh in most cases.

Pagination¶

🤔 Challenges:

How does pagination work if the SQL query defines its own LIMIT/OFFSET?

How do we apply LIMIT/OFFSET to user-written queries?

How do we communicate that LIMIT will be automatically applied to user-written queries?

How do we show the total number of records in the sheet?

Display-centric pagination¶

Control over pagination is delegated to the display portion of the worksheet — not the query portion. Each display type must implement its own pagination controls as it sees fit.
User-written SQL queries are not permitted to contain LIMIT or OFFSET clauses. If either of these clauses is present at the top level of the query, Mathesar will display an error.
The worksheet combines the SQL from the query section with the appropriate LIMIT and OFFSET as specified from the pagination controls.
We don’t worry about communicating this auto-applied LIMIT/OFFSET to users up front. If they try to add their own, LIMIT and OFFSET clauses, then they’ll see an appropriate error message and learn how it works at that point.

Record totals¶

A supplemental query is run by pipelining the user-written worksheet query into count(*). That count data is passed to the front end in the same API response as the query results.

Outline view¶

🤔 Challenges:

How do we show the total number of items in an outline group without necessarily having all the items present in the result set (due to pagination)?

Example¶

Query

select
  books.id as book_id,
  books.title,
  books.author_id
from books;

Sheet display

author Alice Davis (1)

book_id title author

1 921 Foo Alice Davis

author Bob Scott (2)

book_id title author

2 32 Bar Bob Scott

3 1007 Apple Bob Scott

Nested pagination¶

Each outline group has its own pagination control.
The service layer issues one query per outline group with its own limit and offset.
- These queries can still be batched at the API layer. This way if the user refreshes the worksheet, only one API request is needed to reload all outline groups.
- These queries could potentially even be batched to make one TCP request to the Postgres server as well. Since they’ll all have the same columns, they could be combined with UNION ALL and then separated out afterwards.
All outline groups are collapsed by default. Some UI exists to collapse and expand all groups together.

When the worksheet first loads, only the group headers are fetched and shown. To fetch all those group headers and counts, the worksheet query, as-written, is transformed so as to produce the outline group data. First any specified limit and offset are removed. Then the query is pipelined via CTE to give the group header data.

In the example above, the raw outline group data is fetched via:

with q as (              -- ╮
  select                 -- │
    books.id as book_id, -- │
    books.title,         -- ├── Worksheet query here,
    books.author_id      -- │   without LIMIT/OFFSET
  from books             -- │
)                        -- ╯
select
  author_id,
  count(*) as num
from q
group by author_id;

For super large data sets, the groups themselves are also paginated — hence the “nested” pagination. There are multiple levels of pagination.
Per-group pagination controls are placed in the group header with a condensed UI.
Sheet-wide pagination of outline group headers is placed in the lower status bar and indicates that it operates on outline groups instead of data rows.

Default worksheets¶

🤔 Challenges:

How do we allow default worksheets to be set for tables without messing things up? For example, we want to avoid a default which has hidden columns.

Behavior¶

A table’s “Standard Worksheet” is the worksheet that Mathesar auto-generates for that table. It cannot be modified by the user.
A table’s “Default Worksheet” is the worksheet that opens when a user clicks to browse data in a table. It can be modified
A worksheet is eligible to be set as the default only if it has the following deviations from the standard worksheet:
- Per-column display options, including column width
- Column re-ordering
No other modifications are permitted to be done on the default worksheet. The rationale for this rule is that we want the default worksheet to provide an accurate picture of the table’s underlying structure within PostgreSQL.
Functionality is also available to revert a default worksheet to the standard worksheet.

Column control¶

🤔 Challenges:

Should column reordering modify query or modify sheet display options?

How are new columns added (or not added) to existing worksheets after being added to a table?

Can a PK column be added to the query (to support DML) but hidden from the sheet (to simplify the view)?

For more complex queries, is there a way we can auto-generate column aliases in order to have sensible defaults?

TODO

book_id	title	author	author_url	genres	copies
`921`	`'Foo'`	`5`	`NULL`	`{101,102}`	`10`
`32`	`'Bar'`	`6`	`'http://example.com'`	`{}`	`0`
`1007`	`'Apple'`	`6`	`'http://example.com'`	`{103}`	`2`

book_id	title	author	author_url	genres	copies
921	Foo	`Alice Davis`	`NULL`	`Adventure` `Fantasy`	10
32	Bar	`Bob Scott`	http://example.com		0
1007	Apple	`Bob Scott`	http://example.com	`Non-fiction`	2