Skip to content

Commit

Permalink
describe Bar-Hillel construction
Browse files Browse the repository at this point in the history
  • Loading branch information
breandan committed Oct 3, 2023
1 parent 8b14737 commit 34804c6
Show file tree
Hide file tree
Showing 7 changed files with 280 additions and 102 deletions.
Binary file modified latex/popl2024/popl.pdf
Binary file not shown.
10 changes: 5 additions & 5 deletions latex/popl2024/popl.tex
Original file line number Diff line number Diff line change
Expand Up @@ -157,19 +157,19 @@ \section{Problem statement}\label{sec:problem}
In either case, it follows $\Delta(\err{\sigma}, \ell) \leq \max(m, n)$ and $\ell$ is always reachable via a finite nonempty set of Levenshtein edits, i.e., $0 < \Delta(\err{\sigma}, \ell) < \infty$.
\end{proof}

Let us now consider the maximum growth rate of the \textit{admissible set}, $A \coloneq \Delta_q(\err{\sigma}) \cap \ell$, as a function of $q$ and $n$. Let $\bar\ell \coloneq \{\err{\sigma}\}$. Since $\bar\ell$ is finite and thus regular, $\ell = \Sigma^* \setminus \{\err{\sigma}\}$ is regular by the closure of regular languages under complementation, and thus context-free a fortiori. Since $\ell$ accepts every string except $\err{\sigma}$, it represents the worst CFL in terms of asymptotic growth of $A$.
Let us now consider the maximum growth rate of the \textit{admissible set}, $A \coloneqq \Delta_q(\err{\sigma}) \cap \ell$, as a function of $q$ and $n$. Let $\bar\ell \coloneqq \{\err{\sigma}\}$. Since $\bar\ell$ is finite and thus regular, $\ell = \Sigma^* \setminus \{\err{\sigma}\}$ is regular by the closure of regular languages under complementation, and thus context-free a fortiori. Since $\ell$ accepts every string except $\err{\sigma}$, it represents the worst CFL in terms of asymptotic growth of $A$.

\begin{lemma}\label{lemma:interleaving}
The complexity of searching $A$ is upper bounded by $\mathcal{O}\left(\sum_{c=1}^q{{cn + n + 1} \choose c}(|\Sigma| + 1)^c\right)$.
The complexity of searching $A$ is upper bounded by $\mathcal{O}\left(\sum_{c=1}^q{{cn + n + c} \choose c}(|\Sigma| + 1)^c\right)$.
\end{lemma}

\begin{proof}
We can overestimate the size of $A$ by considering the number of unique ways to insert, delete, or substitute $c$ terminals into a string $\err{\sigma}$ of length $n$. This can be overaproximated by interleaving $\varepsilon^c$ around every token, i.e., $\err{\sigma}_\varepsilon\coloneq \left(\varepsilon^c\err{\sigma}_i\right)_{i=1}^n\varepsilon^c$, where $|\err{\sigma}_\varepsilon| = cn + n + 1$, and only considering substitution. We augment $\Sigma_\varepsilon \coloneq \Sigma \cup \{\varepsilon\}$ so that deletions and insertions may be treated as special cases of substitution. Thus, we have $cn + n + 1$ positions to substitute $(|\Sigma_\varepsilon|)$ tokens, i.e., ${{cn + n + 1} \choose c}|\Sigma_\varepsilon|^c$ ways to edit $\err{\sigma}_\varepsilon$ for each $c \in [1, q]$. This upper bound is not tight, as overcounts many identical edits w.r.t. $\err{\sigma}$. Nonetheless, it is sufficient to show $|A| < \sum_{c=1}^q{{cn + n + 1} \choose c}|\Sigma_\varepsilon|^c$.
We can overestimate the size of $A$ by considering the number of unique ways to insert, delete, or substitute $c$ terminals into a string $\err{\sigma}$ of length $n$. This can be overaproximated by interleaving $\varepsilon^c$ around every token, i.e., $\err{\sigma}_\varepsilon\coloneqq \left(\varepsilon^c\err{\sigma}_i\right)_{i=1}^n\varepsilon^c$, where $|\err{\sigma}_\varepsilon| = cn + n + c$, and only considering substitution. We augment $\Sigma_\varepsilon \coloneqq \Sigma \cup \{\varepsilon\}$ so that deletions and insertions may be treated as special cases of substitution. Thus, we have $cn + n + c$ positions to substitute $(|\Sigma_\varepsilon|)$ tokens, i.e., ${{cn + n + c} \choose c}|\Sigma_\varepsilon|^c$ ways to edit $\err{\sigma}_\varepsilon$ for each $c \in [1, q]$. This upper bound is not tight, as overcounts many identical edits w.r.t. $\err{\sigma}$. Nonetheless, it is sufficient to show $|A| < \sum_{c=1}^q{{cn + n + c} \choose c}|\Sigma_\varepsilon|^c$.
\end{proof}

We note that the above bound applies to all strings and languages, and relates to the Hamming bound on $H_q(\err{\sigma}_\varepsilon)$, which only considers substitutions.~\footnote{This reflects our general approach, which builds a surjection from the interleaved Hamming ball onto the Levenshtein ball.} In practice, much tighter bounds may be obtained by considering the structure of $\ell$ and $\err{\sigma}$. For example, based on an empirical evaluation from a dataset of human errors and repairs in Python code snippets ($|\Sigma| = 50, |\err{\sigma}| < 40, \Delta(\err{\sigma}, \ell) \in [1, 3]$), we estimate the \textit{filtration rate}, i.e., the density of the admissible set relative to the Levenshtein ball, $D=|A|/|\Delta_q(\err{\sigma})|$ to have empirical mean $E_\sigma[D] \approx 2.6\times 10^{-4}$, and variance $\mathrm{Var}_\sigma[D] \approx 3.8\times10^{-7}$.

In practice, this problem is ill-posed even when $q = \Delta^*(\err{\sigma}, \ell) \approx 1$. For example, consider the language of ursine dietary preferences. Although $\err{\sigma}\coloneq$ ``Bears like to eat plastic'' is not a valid sentence, e.g., $\tilde{\sigma}\coloneq$``Bears like to eat'' is $(\Delta^*=1)$, however there are many others with roughly the same edit distance, e.g., ``Bears like to eat \{\hlorange{berries}, \hlorange{honey}, \hlorange{fish}\}'', or ``\{\hlgreen{Polar}, \hlgreen{Panda}\} bears like to eat \{\hlgreen{seals}, \hlgreen{bamboo}\}''. In general, there are usually many strings nearby $\err{\sigma}$, and we seek to find those among them which are both syntactically valid and semantically plausible as quickly as possible.
In practice, this problem is ill-posed even when $q = \Delta^*(\err{\sigma}, \ell) \approx 1$. For example, consider the language of ursine dietary preferences. Although $\err{\sigma}\coloneqq$ ``Bears like to eat plastic'' is not a valid sentence, e.g., $\tilde{\sigma}\coloneqq$``Bears like to eat'' is $(\Delta^*=1)$, however there are many others with roughly the same edit distance, e.g., ``Bears like to eat \{\hlorange{berries}, \hlorange{honey}, \hlorange{fish}\}'', or ``\{\hlgreen{Polar}, \hlgreen{Panda}\} bears like to eat \{\hlgreen{seals}, \hlgreen{bamboo}\}''. In general, there are usually many strings nearby $\err{\sigma}$, and we seek to find those among them which are both syntactically valid and semantically plausible as quickly as possible.

\section{Matrix Theory}\label{sec:matrix}

Expand Down Expand Up @@ -388,7 +388,7 @@ \subsection{Tensor sparsification}\label{sec:sparsity}
Regardless of $\sigma$, it is always possible to discharge nonterminals inhabiting $\Lambda_{i, j}$ bitvectors outside the $(n-j+i)$-step neighborhood of $S$, or the $(j-i)$-step neighborhood of any terminal due to unreachability. This constrains several of the least- and greatest- upper diagonals of $\mathcal{M}_\sigma$. Further improvements to sparsity can be achieved by considering the specific structure of $\sigma$ and $\mathcal{G}$.

\begin{definition}[Parikh image]
Let $\pi: \underline\Sigma^*\rightarrow\mathbb{N}^{|\Sigma|}$ be the Parikh vector~\cite{parikh1966context}, which counts the number of times each terminal appears in a string. We define the Parikh image as the set of all terminals indexed a nonzero element of the Parikh vector, i.e., $\Pi(\sigma:\underline\Sigma^*) \coloneq \{\sigma': \Sigma \mid 0 < \pi(\sigma)[\sigma']\}$. %Ordered by inclusion, the set of all minimal Parikh images, $\Pi^*(\sigma)\coloneqq \min_{\bm\sigma : 2^V \mid \forall \sigma' \in \bm\sigma \sigma' \in \text{H}(\sigma) \text{ and } }$ forms a partial order.
Let $\pi: \underline\Sigma^*\rightarrow\mathbb{N}^{|\Sigma|}$ be the Parikh vector~\cite{parikh1966context}, which counts the number of times each terminal appears in a string. We define the Parikh image as the set of all terminals indexed a nonzero element of the Parikh vector, i.e., $\Pi(\sigma:\underline\Sigma^*) \coloneqq \{\sigma': \Sigma \mid 0 < \pi(\sigma)[\sigma']\}$. %Ordered by inclusion, the set of all minimal Parikh images, $\Pi^*(\sigma)\coloneqq \min_{\bm\sigma : 2^V \mid \forall \sigma' \in \bm\sigma \sigma' \in \text{H}(\sigma) \text{ and } }$ forms a partial order.
\end{definition}

\begin{lemma}
Expand Down
139 changes: 139 additions & 0 deletions latex/tacas2023/nfa_cfg.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
\begin{minipage}[c]{0.45\textwidth}
\centering
\underline{NIA}\vspace{10pt}
\resizebox{\textwidth}{!}{
\begin{tikzpicture}[
%->, % makes the edges directed
>=stealth',
node distance=2.5cm, % specifies the minimum distance between two nodes. Change if necessary.
% every state/.style={thick, fill=gray!10}, % sets the properties for each ’state’ node
initial text=$ $, % sets the text that appears on the start arrow
]
\node[state, initial] (00) {$q_{0,0}$};
\node[state, right of=00] (10) {$q_{1,0}$};
\node[state, right of=10] (20) {$q_{2,0}$};
\node[state, right of=20] (30) {$q_{3,0}$};
\node[right of=30] (40) {$\vphantom{\vdots}\cdots$};
\node[accepting, state, right of=40] (n0) {$q_{n,0}$};

\node[state, above of=00] (01) {$q_{0,1}$};
\node[state, right of=01] (11) {$q_{1,1}$};
\node[state, right of=11] (21) {$q_{2,1}$};
\node[state, right of=21] (31) {$q_{3,1}$};
\node[right of=31] (41) {$\vphantom{\vdots}\cdots$};
\node[accepting, state, right of=41] (n1) {$q_{n,1}$};

\node[above of=01] (0j) {$\mathmakebox[\widthof{$\cdots$}]{\vdots}$};
\node[right of=0j] (1j) {$\mathmakebox[\widthof{$\cdots$}]{\vdots}$};
\node[right of=1j] (2j) {$\mathmakebox[\widthof{$\cdots$}]{\vdots}$};
\node[right of=2j] (3j) {$\mathmakebox[\widthof{$\cdots$}]{\vdots}$};
\node[right of=3j] (4j) {$\iddots$};
\node[accepting, right of=4j] (nj) {$\mathmakebox[\widthof{$\cdots$}]{\vdots}$};

\node[state, above of=0j] (0k) {$q_{0,k}$};
\node[state, right of=0k] (1k) {$q_{1,k}$};
\node[state, right of=1k] (2k) {$q_{2,k}$};
\node[state, right of=2k] (3k) {$q_{3,k}$};
\node[right of=3k] (4k) {$\vphantom{\vdots}\cdots$};
\node[accepting, state, right of=4k] (nk) {$q_{n,k}$};

\draw [->] (00) edge[below] node{$\sigma_1$} (10);
\draw [->] (10) edge[below] node{$\sigma_2$} (20);
\draw [->] (20) edge[below] node{$\sigma_3$} (30);
\draw [->] (30) edge[below] node{$\sigma_4$} (40);
\draw [->] (40) edge[below] node{$\sigma_n$} (n0);

\draw [->] (01) edge[below] node{$\sigma_1$} (11);
\draw [->] (11) edge[below] node{$\sigma_2$} (21);
\draw [->] (21) edge[below] node{$\sigma_3$} (31);
\draw [->] (31) edge[below] node{$\sigma_4$} (41);
\draw [->] (41) edge[below] node{$\sigma_n$} (n1);

\draw [->] (0j) edge[below] node{$\sigma_1$} (1j);
\draw [->] (1j) edge[below] node{$\sigma_2$} (2j);
\draw [->] (2j) edge[below] node{$\sigma_3$} (3j);
\draw [->] (3j) edge[below] node{$\sigma_4$} (4j);
\draw [->] (4j) edge[below] node{$\sigma_n$} (nj);

\draw [->] (0k) edge[below] node{$\sigma_1$} (1k);
\draw [->] (1k) edge[below] node{$\sigma_2$} (2k);
\draw [->] (2k) edge[below] node{$\sigma_3$} (3k);
\draw [->] (3k) edge[below] node{$\sigma_4$} (4k);
\draw [->] (4k) edge[below] node{$\sigma_n$} (nk);

\draw [->] (00) edge[left] node{$*$} (11);
\draw [->] (10) edge[left] node{$*$} (21);
\draw [->] (20) edge[left] node{$*$} (31);
\draw [->] (30) edge[left] node{$*$} (41);
\draw [->] (30) edge[bend right, below] node{$\sigma_5$} (41);
\draw [->] (40) edge[ right] node{$\sigma_n$} (n1);
\draw [->] (40) edge[bend right, left] node{$*$} (n1);

\draw [->] (01) edge[left] node{$*$} (1j);
\draw [->] (11) edge[left] node{$*$} (2j);
\draw [->] (21) edge[left] node{$*$} (3j);
\draw [->] (31) edge[left] node{$*$} (4j);
\draw [->] (31) edge[bend right, below] node{$\sigma_5$} (4j);
\draw [->] (41) edge[ right] node{$\sigma_n$} (nj);
\draw [->] (41) edge[bend right, left] node{$*$} (nj);

\draw [->] (0j) edge[left] node{$*$} (1k);
\draw [->] (1j) edge[left] node{$*$} (2k);
\draw [->] (2j) edge[left] node{$*$} (3k);
\draw [->] (3j) edge[left] node{$*$} (4k);
\draw [->] (3j) edge[bend right, below] node{$\sigma_5$} (4k);
\draw [->] (4j) edge[ right] node{$\sigma_n$} (nk);
\draw [->] (4j) edge[bend right, left] node{$*$} (nk);

\draw [->] (00) edge[bend left, left] node{$*$} (01);
\draw [->] (10) edge[bend left, left] node{$*$} (11);
\draw [->] (20) edge[bend left, left] node{$*$} (21);
\draw [->] (30) edge[bend left, left] node{$*$} (31);
\draw [->] (40) edge[right] node{$*$} (41);
\draw [->] (n0) edge[bend right, right] node{$*$} (n1);

\draw [->] (01) edge[bend left, left] node{$*$} (0j);
\draw [->] (11) edge[bend left, left] node{$*$} (1j);
\draw [->] (21) edge[bend left, left] node{$*$} (2j);
\draw [->] (31) edge[bend left, left] node{$*$} (3j);
\draw [->] (41) edge[right] node{$*$} (4j);
\draw [->] (n1) edge[bend right, right] node{$*$} (nj);

\draw [->] (0j) edge[bend left, left] node{$*$} (0k);
\draw [->] (1j) edge[bend left, left] node{$*$} (1k);
\draw [->] (2j) edge[bend left, left] node{$*$} (2k);
\draw [->] (3j) edge[bend left, left] node{$*$} (3k);
\draw [->] (4j) edge[right] node{$*$} (4k);
\draw [->] (nj) edge[bend right, right] node{$*$} (nk);

\draw [->] (00) edge[below] node{$\sigma_2$} (21);
\draw [->] (10) edge[below] node{$\sigma_3$} (31);
\draw [->] (20) edge[below] node{$\sigma_4$} (41);

\draw [->] (01) edge[below] node{$\sigma_2$} (2j);
\draw [->] (11) edge[below] node{$\sigma_3$} (3j);
\draw [->] (21) edge[below] node{$\sigma_4$} (4j);

\draw [->] (0j) edge[below] node{$\sigma_2$} (2k);
\draw [->] (1j) edge[below] node{$\sigma_3$} (3k);
\draw [->] (2j) edge[below] node{$\sigma_4$} (4k);

%https://tex.stackexchange.com/a/20986/139648
\draw [decorate,decoration={brace,amplitude=10pt,raise=10pt,mirror}] (00.south west) -- (n0.south east) node[midway,yshift=-3em]{\textbf{String length}};
\draw [decorate,decoration={brace,amplitude=10pt,raise=20pt}] (00.south west) -- (0k.north west) node[midway,xshift=-40pt,rotate=90]{\textbf{Levenshtein edit distance}};
\end{tikzpicture}
}
\end{minipage}
\hfill
\begin{minipage}[l]{5 cm}
\centering
\underline{CFG}\vspace{7pt}
\begin{align*}
S &\Rightarrow \{\cdot \in Q \mid \delta(\cdot, q_{n,0}) \leq k\}\\
* &\Rightarrow \{\cdot \in \Sigma\}\\
\big\{q_{i, j} &\Rightarrow \{q_{i, j-1}*\} \mid i, j \in [1, n]\times[1, k]\big\}\\
\big\{q_{i, j} &\Rightarrow \{q_{i-1, j-1}*\}\mid i, j\in[1, n]\times [1, k]\big\}\\
\big\{q_{i, j} &\Rightarrow \{q_{i-1, j} \sigma_i \}\mid i, j \in [1, n]\times[0, k]\big\} \\
\big\{q_{i, j} &\Rightarrow \{q_{i-2, j-1} \sigma_i\} \mid i, j \in [2, n]\times[1, k] \big\}\\
\end{align*}
\end{minipage}
Binary file modified latex/tacas2023/tacas.pdf
Binary file not shown.
Loading

0 comments on commit 34804c6

Please sign in to comment.