Home > Events > CLIP Colloquium: Joe Barrow (CS) + Chenglei Si (UMD)
S M T W T F S
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
 
9
 
10
 
11
 
12
 
13
 
14
 
15
 
16
 
17
 
18
 
19
 
20
 
21
 
22
 
23
 
24
 
25
 
26
 
27
 
28
 
29
 
30
 
 
 
 
 

CLIP Colloquium: Joe Barrow (CS) + Chenglei Si (UMD)

Time: 
Wednesday, March 16, 2022 - 11:00 AM to 12:00 PM
Location: 
5101 Brendan Iribe Center

Structural Scaffolds for Making Sense of Document Collections (by Joe Barrow)

Abstract: As readers, we often attempt to make sense of (one or more) documents using structure that goes beyond the content itself: a scientist using sections and subsections to "pre-read" a scientific paper or a web searcher trying to make sense of conflicting viewpoints about a topic. This structure helps a reader build mental maps of the information; without them, it is easy to "miss the forest for the trees." In this work, we aim to induce this structure in cases where it is not already explicit, which we refer to as "structural scaffolding." In particular, we focus on two types of scaffolds: topical scaffolds of documents, where we create labeled sections over an unstructured document to support pre-reading, and syntopical scaffolds of document collections, where we identify, group, and present viewpoints from many documents at once. We find that "content-only" approaches build worse scaffolds for both types than approaches that account for both content and context.

Bio: Joe Barrow is a PhD Student at UMD, working with Prof. Philip Resnik and Prof. Doug Oard. He's interested in building tools that help people learn and make informed decisions.

---

Tokenization in the era of pretrained language models (by Chenglei Si)

Abstract: In this talk, we review the different tokenization strategies used in various pretrained language models (PLM). We focus on the granularity of tokenization (sub-character, character, sub-word, word) and compare their pros and cons in performance, efficiency and robustness. In particular, we highlight two of our own works along this line, one on fusing character and sub-word representations in English PLM, and the other on a novel sub-character tokenization scheme designed for Chinese PLM.

Bio: Chenglei Si is an undergraduate at UMD advised by professor Jordan Boyd-Graber. His current research mainly focuses on question answering and generalization. He has published several first-authored papers at ACL and EMNLP. This summer, he will join Microsoft as a research intern to work on GPT-3.