describe Bar-Hillel construction

breandan · Oct 3, 2023 · 34804c6 · 34804c6
1 parent 8b14737
commit 34804c6
Show file tree

Hide file tree

Showing 7 changed files with 280 additions and 102 deletions.
diff --git a/latex/popl2024/popl.pdf b/latex/popl2024/popl.pdf
diff --git a/latex/popl2024/popl.tex b/latex/popl2024/popl.tex
@@ -157,19 +157,19 @@ \section{Problem statement}\label{sec:problem}
   In either case, it follows $\Delta(\err{\sigma}, \ell) \leq \max(m, n)$ and $\ell$ is always reachable via a finite nonempty set of Levenshtein edits, i.e., $0 < \Delta(\err{\sigma}, \ell) < \infty$.
 \end{proof}
 
-Let us now consider the maximum growth rate of the \textit{admissible set}, $A \coloneq \Delta_q(\err{\sigma}) \cap \ell$, as a function of $q$ and $n$. Let $\bar\ell \coloneq \{\err{\sigma}\}$. Since $\bar\ell$ is finite and thus regular, $\ell = \Sigma^* \setminus \{\err{\sigma}\}$ is regular by the closure of regular languages under complementation, and thus context-free a fortiori. Since $\ell$ accepts every string except $\err{\sigma}$, it represents the worst CFL in terms of asymptotic growth of $A$.
+Let us now consider the maximum growth rate of the \textit{admissible set}, $A \coloneqq \Delta_q(\err{\sigma}) \cap \ell$, as a function of $q$ and $n$. Let $\bar\ell \coloneqq \{\err{\sigma}\}$. Since $\bar\ell$ is finite and thus regular, $\ell = \Sigma^* \setminus \{\err{\sigma}\}$ is regular by the closure of regular languages under complementation, and thus context-free a fortiori. Since $\ell$ accepts every string except $\err{\sigma}$, it represents the worst CFL in terms of asymptotic growth of $A$.
 
 \begin{lemma}\label{lemma:interleaving}
-  The complexity of searching $A$ is upper bounded by $\mathcal{O}\left(\sum_{c=1}^q{{cn + n + 1} \choose c}(|\Sigma| + 1)^c\right)$.
+  The complexity of searching $A$ is upper bounded by $\mathcal{O}\left(\sum_{c=1}^q{{cn + n + c} \choose c}(|\Sigma| + 1)^c\right)$.
 \end{lemma}
 
 \begin{proof}
-  We can overestimate the size of $A$ by considering the number of unique ways to insert, delete, or substitute $c$ terminals into a string $\err{\sigma}$ of length $n$. This can be overaproximated by interleaving $\varepsilon^c$ around every token, i.e., $\err{\sigma}_\varepsilon\coloneq \left(\varepsilon^c\err{\sigma}_i\right)_{i=1}^n\varepsilon^c$, where $|\err{\sigma}_\varepsilon| = cn + n + 1$, and only considering substitution. We augment $\Sigma_\varepsilon \coloneq \Sigma \cup \{\varepsilon\}$ so that deletions and insertions may be treated as special cases of substitution. Thus, we have $cn + n + 1$ positions to substitute $(|\Sigma_\varepsilon|)$ tokens, i.e., ${{cn + n + 1} \choose c}|\Sigma_\varepsilon|^c$ ways to edit $\err{\sigma}_\varepsilon$ for each $c \in [1, q]$. This upper bound is not tight, as overcounts many identical edits w.r.t. $\err{\sigma}$. Nonetheless, it is sufficient to show $|A| < \sum_{c=1}^q{{cn + n + 1} \choose c}|\Sigma_\varepsilon|^c$.
+  We can overestimate the size of $A$ by considering the number of unique ways to insert, delete, or substitute $c$ terminals into a string $\err{\sigma}$ of length $n$. This can be overaproximated by interleaving $\varepsilon^c$ around every token, i.e., $\err{\sigma}_\varepsilon\coloneqq \left(\varepsilon^c\err{\sigma}_i\right)_{i=1}^n\varepsilon^c$, where $|\err{\sigma}_\varepsilon| = cn + n + c$, and only considering substitution. We augment $\Sigma_\varepsilon \coloneqq \Sigma \cup \{\varepsilon\}$ so that deletions and insertions may be treated as special cases of substitution. Thus, we have $cn + n + c$ positions to substitute $(|\Sigma_\varepsilon|)$ tokens, i.e., ${{cn + n + c} \choose c}|\Sigma_\varepsilon|^c$ ways to edit $\err{\sigma}_\varepsilon$ for each $c \in [1, q]$. This upper bound is not tight, as overcounts many identical edits w.r.t. $\err{\sigma}$. Nonetheless, it is sufficient to show $|A| < \sum_{c=1}^q{{cn + n + c} \choose c}|\Sigma_\varepsilon|^c$.
 \end{proof}
 
 We note that the above bound applies to all strings and languages, and relates to the Hamming bound on $H_q(\err{\sigma}_\varepsilon)$, which only considers substitutions.~\footnote{This reflects our general approach, which builds a surjection from the interleaved Hamming ball onto the Levenshtein ball.} In practice, much tighter bounds may be obtained by considering the structure of $\ell$ and $\err{\sigma}$. For example, based on an empirical evaluation from a dataset of human errors and repairs in Python code snippets ($|\Sigma| = 50, |\err{\sigma}| < 40, \Delta(\err{\sigma}, \ell) \in [1, 3]$), we estimate the \textit{filtration rate}, i.e., the density of the admissible set relative to the Levenshtein ball, $D=|A|/|\Delta_q(\err{\sigma})|$ to have empirical mean $E_\sigma[D] \approx 2.6\times 10^{-4}$, and variance $\mathrm{Var}_\sigma[D] \approx 3.8\times10^{-7}$.
 
-In practice, this problem is ill-posed even when $q = \Delta^*(\err{\sigma}, \ell) \approx 1$. For example, consider the language of ursine dietary preferences. Although $\err{\sigma}\coloneq$ ``Bears like to eat plastic'' is not a valid sentence, e.g., $\tilde{\sigma}\coloneq$``Bears like to eat'' is $(\Delta^*=1)$, however there are many others with roughly the same edit distance, e.g., ``Bears like to eat \{\hlorange{berries}, \hlorange{honey}, \hlorange{fish}\}'', or ``\{\hlgreen{Polar}, \hlgreen{Panda}\} bears like to eat \{\hlgreen{seals}, \hlgreen{bamboo}\}''. In general, there are usually many strings nearby $\err{\sigma}$, and we seek to find those among them which are both syntactically valid and semantically plausible as quickly as possible.
+In practice, this problem is ill-posed even when $q = \Delta^*(\err{\sigma}, \ell) \approx 1$. For example, consider the language of ursine dietary preferences. Although $\err{\sigma}\coloneqq$ ``Bears like to eat plastic'' is not a valid sentence, e.g., $\tilde{\sigma}\coloneqq$``Bears like to eat'' is $(\Delta^*=1)$, however there are many others with roughly the same edit distance, e.g., ``Bears like to eat \{\hlorange{berries}, \hlorange{honey}, \hlorange{fish}\}'', or ``\{\hlgreen{Polar}, \hlgreen{Panda}\} bears like to eat \{\hlgreen{seals}, \hlgreen{bamboo}\}''. In general, there are usually many strings nearby $\err{\sigma}$, and we seek to find those among them which are both syntactically valid and semantically plausible as quickly as possible.
 
 \section{Matrix Theory}\label{sec:matrix}
 
@@ -388,7 +388,7 @@ \subsection{Tensor sparsification}\label{sec:sparsity}
 Regardless of $\sigma$, it is always possible to discharge nonterminals inhabiting $\Lambda_{i, j}$ bitvectors outside the $(n-j+i)$-step neighborhood of $S$, or the $(j-i)$-step neighborhood of any terminal due to unreachability. This constrains several of the least- and greatest- upper diagonals of $\mathcal{M}_\sigma$. Further improvements to sparsity can be achieved by considering the specific structure of $\sigma$ and $\mathcal{G}$.
 
 \begin{definition}[Parikh image]
-  Let $\pi: \underline\Sigma^*\rightarrow\mathbb{N}^{|\Sigma|}$ be the Parikh vector~\cite{parikh1966context}, which counts the number of times each terminal appears in a string. We define the Parikh image as the set of all terminals indexed a nonzero element of the Parikh vector, i.e., $\Pi(\sigma:\underline\Sigma^*) \coloneq \{\sigma': \Sigma \mid 0 < \pi(\sigma)[\sigma']\}$. %Ordered by inclusion, the set of all minimal Parikh images, $\Pi^*(\sigma)\coloneqq \min_{\bm\sigma : 2^V \mid \forall \sigma' \in \bm\sigma \sigma' \in \text{H}(\sigma) \text{ and } }$ forms a partial order.
+  Let $\pi: \underline\Sigma^*\rightarrow\mathbb{N}^{|\Sigma|}$ be the Parikh vector~\cite{parikh1966context}, which counts the number of times each terminal appears in a string. We define the Parikh image as the set of all terminals indexed a nonzero element of the Parikh vector, i.e., $\Pi(\sigma:\underline\Sigma^*) \coloneqq \{\sigma': \Sigma \mid 0 < \pi(\sigma)[\sigma']\}$. %Ordered by inclusion, the set of all minimal Parikh images, $\Pi^*(\sigma)\coloneqq \min_{\bm\sigma : 2^V \mid \forall \sigma' \in \bm\sigma \sigma' \in \text{H}(\sigma) \text{ and } }$ forms a partial order.
 \end{definition}
 
 \begin{lemma}

diff --git a/latex/tacas2023/nfa_cfg.tex b/latex/tacas2023/nfa_cfg.tex
@@ -0,0 +1,139 @@
+\begin{minipage}[c]{0.45\textwidth}
+  \centering
+  \underline{NIA}\vspace{10pt}
+  \resizebox{\textwidth}{!}{
+  \begin{tikzpicture}[
+%->, % makes the edges directed
+  >=stealth',
+  node distance=2.5cm, % specifies the minimum distance between two nodes. Change if necessary.
+%  every state/.style={thick, fill=gray!10}, % sets the properties for each ’state’ node
+  initial text=$ $, % sets the text that appears on the start arrow
+  ]
+  \node[state, initial]                (00) {$q_{0,0}$};
+  \node[state, right of=00]            (10) {$q_{1,0}$};
+  \node[state, right of=10]            (20) {$q_{2,0}$};
+  \node[state, right of=20]            (30) {$q_{3,0}$};
+  \node[right of=30]                   (40) {$\vphantom{\vdots}\cdots$};
+  \node[accepting, state, right of=40] (n0) {$q_{n,0}$};
+
+  \node[state, above of=00]            (01) {$q_{0,1}$};
+  \node[state, right of=01]            (11) {$q_{1,1}$};
+  \node[state, right of=11]            (21) {$q_{2,1}$};
+  \node[state, right of=21]            (31) {$q_{3,1}$};
+  \node[right of=31]                   (41) {$\vphantom{\vdots}\cdots$};
+  \node[accepting, state, right of=41] (n1) {$q_{n,1}$};
+
+  \node[above of=01]                   (0j) {$\mathmakebox[\widthof{$\cdots$}]{\vdots}$};
+\node[right of=0j]                   (1j) {$\mathmakebox[\widthof{$\cdots$}]{\vdots}$};
+\node[right of=1j]                   (2j) {$\mathmakebox[\widthof{$\cdots$}]{\vdots}$};
+\node[right of=2j]                   (3j) {$\mathmakebox[\widthof{$\cdots$}]{\vdots}$};
+\node[right of=3j]                   (4j) {$\iddots$};
+\node[accepting, right of=4j]        (nj) {$\mathmakebox[\widthof{$\cdots$}]{\vdots}$};
+
+\node[state, above of=0j]            (0k) {$q_{0,k}$};
+\node[state, right of=0k]            (1k) {$q_{1,k}$};
+\node[state, right of=1k]            (2k) {$q_{2,k}$};
+\node[state, right of=2k]            (3k) {$q_{3,k}$};
+\node[right of=3k]                   (4k) {$\vphantom{\vdots}\cdots$};
+\node[accepting, state, right of=4k] (nk) {$q_{n,k}$};
+
+\draw [->] (00) edge[below] node{$\sigma_1$} (10);
+\draw [->] (10) edge[below] node{$\sigma_2$} (20);
+\draw [->] (20) edge[below] node{$\sigma_3$} (30);
+\draw [->] (30) edge[below] node{$\sigma_4$} (40);
+\draw [->] (40) edge[below] node{$\sigma_n$} (n0);
+
+\draw [->] (01) edge[below] node{$\sigma_1$} (11);
+\draw [->] (11) edge[below] node{$\sigma_2$} (21);
+\draw [->] (21) edge[below] node{$\sigma_3$} (31);
+\draw [->] (31) edge[below] node{$\sigma_4$} (41);
+\draw [->] (41) edge[below] node{$\sigma_n$} (n1);
+
+\draw [->] (0j) edge[below] node{$\sigma_1$} (1j);
+\draw [->] (1j) edge[below] node{$\sigma_2$} (2j);
+\draw [->] (2j) edge[below] node{$\sigma_3$} (3j);
+\draw [->] (3j) edge[below] node{$\sigma_4$} (4j);
+\draw [->] (4j) edge[below] node{$\sigma_n$} (nj);
+
+\draw [->] (0k) edge[below] node{$\sigma_1$} (1k);
+\draw [->] (1k) edge[below] node{$\sigma_2$} (2k);
+\draw [->] (2k) edge[below] node{$\sigma_3$} (3k);
+\draw [->] (3k) edge[below] node{$\sigma_4$} (4k);
+\draw [->] (4k) edge[below] node{$\sigma_n$} (nk);
+
+\draw [->] (00) edge[left] node{$*$}         (11);
+\draw [->] (10) edge[left] node{$*$}         (21);
+\draw [->] (20) edge[left] node{$*$}         (31);
+\draw [->] (30) edge[left] node{$*$}         (41);
+\draw [->] (30) edge[bend right, below] node{$\sigma_5$} (41);
+\draw [->] (40) edge[           right] node{$\sigma_n$}  (n1);
+\draw [->] (40) edge[bend right, left] node{$*$}         (n1);
+
+\draw [->] (01) edge[left] node{$*$}                     (1j);
+\draw [->] (11) edge[left] node{$*$}                     (2j);
+\draw [->] (21) edge[left] node{$*$}                     (3j);
+\draw [->] (31) edge[left] node{$*$}                     (4j);
+\draw [->] (31) edge[bend right, below] node{$\sigma_5$} (4j);
+\draw [->] (41) edge[           right] node{$\sigma_n$}  (nj);
+\draw [->] (41) edge[bend right, left] node{$*$}         (nj);
+
+\draw [->] (0j) edge[left] node{$*$}                     (1k);
+\draw [->] (1j) edge[left] node{$*$}                     (2k);
+\draw [->] (2j) edge[left] node{$*$}                     (3k);
+\draw [->] (3j) edge[left] node{$*$}                     (4k);
+\draw [->] (3j) edge[bend right, below] node{$\sigma_5$} (4k);
+\draw [->] (4j) edge[           right] node{$\sigma_n$}  (nk);
+\draw [->] (4j) edge[bend right, left] node{$*$}         (nk);
+
+\draw [->] (00) edge[bend left, left] node{$*$}   (01);
+\draw [->] (10) edge[bend left, left] node{$*$}   (11);
+\draw [->] (20) edge[bend left, left] node{$*$}   (21);
+\draw [->] (30) edge[bend left, left] node{$*$}   (31);
+\draw [->] (40) edge[right] node{$*$}             (41);
+\draw [->] (n0) edge[bend right, right] node{$*$} (n1);
+
+\draw [->] (01) edge[bend left, left] node{$*$}   (0j);
+\draw [->] (11) edge[bend left, left] node{$*$}   (1j);
+\draw [->] (21) edge[bend left, left] node{$*$}   (2j);
+\draw [->] (31) edge[bend left, left] node{$*$}   (3j);
+\draw [->] (41) edge[right] node{$*$}             (4j);
+\draw [->] (n1) edge[bend right, right] node{$*$} (nj);
+
+\draw [->] (0j) edge[bend left, left] node{$*$}   (0k);
+\draw [->] (1j) edge[bend left, left] node{$*$}   (1k);
+\draw [->] (2j) edge[bend left, left] node{$*$}   (2k);
+\draw [->] (3j) edge[bend left, left] node{$*$}   (3k);
+\draw [->] (4j) edge[right] node{$*$}             (4k);
+\draw [->] (nj) edge[bend right, right] node{$*$} (nk);
+
+\draw [->] (00) edge[below] node{$\sigma_2$}    (21);
+\draw [->] (10) edge[below] node{$\sigma_3$}    (31);
+\draw [->] (20) edge[below] node{$\sigma_4$}    (41);
+
+\draw [->] (01) edge[below] node{$\sigma_2$}    (2j);
+\draw [->] (11) edge[below] node{$\sigma_3$}    (3j);
+\draw [->] (21) edge[below] node{$\sigma_4$}    (4j);
+
+\draw [->] (0j) edge[below] node{$\sigma_2$}    (2k);
+\draw [->] (1j) edge[below] node{$\sigma_3$}    (3k);
+\draw [->] (2j) edge[below] node{$\sigma_4$}    (4k);
+
+%https://tex.stackexchange.com/a/20986/139648
+\draw [decorate,decoration={brace,amplitude=10pt,raise=10pt,mirror}] (00.south west) -- (n0.south east) node[midway,yshift=-3em]{\textbf{String length}};
+\draw [decorate,decoration={brace,amplitude=10pt,raise=20pt}] (00.south west) -- (0k.north west) node[midway,xshift=-40pt,rotate=90]{\textbf{Levenshtein edit distance}};
+\end{tikzpicture}
+}
+\end{minipage}
+\hfill
+\begin{minipage}[l]{5 cm}
+\centering
+\underline{CFG}\vspace{7pt}
+\begin{align*}
+S &\Rightarrow \{\cdot \in Q \mid \delta(\cdot, q_{n,0}) \leq k\}\\
+* &\Rightarrow \{\cdot \in \Sigma\}\\
+\big\{q_{i, j} &\Rightarrow \{q_{i, j-1}*\} \mid i, j \in [1, n]\times[1, k]\big\}\\
+\big\{q_{i, j} &\Rightarrow \{q_{i-1, j-1}*\}\mid i, j\in[1, n]\times [1, k]\big\}\\
+\big\{q_{i, j} &\Rightarrow \{q_{i-1, j} \sigma_i \}\mid i, j \in [1, n]\times[0, k]\big\} \\
+\big\{q_{i, j} &\Rightarrow \{q_{i-2, j-1} \sigma_i\} \mid i, j \in [2, n]\times[1, k] \big\}\\
+\end{align*}
+\end{minipage}
diff --git a/latex/tacas2023/tacas.pdf b/latex/tacas2023/tacas.pdf