Summary of Semester

Here’s a summary of the topics for the semester:

Week 1: Introduction

  • Attention, Transformers, and BERT
  • Training LLMs, Risks and Rewards

Week 2: Alignment

  • Introduction to AI Alignment and Failure Cases
  • Redteaming
  • Jail-breaking LLMs

Week 3: Prompting and Bias

  • Prompt Engineering
  • Marked Personas

Week 4: Capabilities of LLMs

  • LLM Capabilities
  • Medical Applications of LLMs

Week 5: Hallucination

  • Hallucination Risks
  • Potential Solutions

Week 6: Visit from Anton Korinek

Week 7: Generative Adversarial Networks and DeepFakes

  • GANs and DeepFakes
  • Creation and Detection of DeepFake Videos

Week 8: Machine Translation

  • History of Machine Translation
  • Neural Machine Translation

Week 9: Interpretability

  • Introduction to Interpretability
  • Mechanistic Interpretability

Week 10: Data for Training

  • Data Selection for Fine-tuning LLMs
  • Detecting Pretraining Data from Large Language Models
  • Impact of Data on Large Language Models
  • The Curse of Recursion: Training on Generated Data Makes Models Forget

Week 11: Watermarking

  • Watermarking LLM Outputs
  • Watermarking Diffusion Models

Week 12: LLM Agents

  • LLM Agents
  • Tools and Planning

Week 13: Regulating Dangerous Technologies

  • Analogies from other technologies for regulating AI

Week 14a: Multimodal Models
Week 14b: Ethical AI

Week 14b: Ethical AI

Presenting Team: Aparna Kishore, Elena Long, Erzhen Hu, Jingping Wan
Blogging Team: Haolin Liu, Haochen Liu, Ji Hyun Kim, Stephanie Schoch, Xueren Ge

Note: since the topics were unrelated, Week 14 is split into two posts:

Wednesday, November 29: Ethical AI

Ben Shneiderman. Bridging the Gap Between Ethics and Practice: Guidelines for Reliable, Safe, and Trustworthy Human-centered AI Systems. ACM Transactions on Interactive Intelligent Systems, October 2020. PDF

Today’s topic is ethical AI, with a focus on human-centered AI (HCAI). From this perspective, AI is seen as amplifying the performance of humans.

Important to HCAI is the need for reliable, safe and trustworthy properties, through the collaboration of software engineers, companies, government, and society as a whole.

  1. Reliable Systems: Soft Engineering
  2. Safety Culture: Organizational Design
  3. Trustworthy Certification: External Reviews

Things that should be considered when developing ethical AI:

  1. Data quality
  2. Training log analysis
  3. Privacy and security of data

Example:FDR has quantitative benchmark to see if a plane is safe/stable, which can help in designing the next generation of products

Analogy of FDR to AI: We could get quantitative feedback of the product or strategy we want to test: What data do we need, how do we analyze log data (or select useful data from operation logs), how to protect data from being attacked, etc.

Through a similar approach, we can say that AI is safe through testing and logs, rather than just ‘take our word for it’

Software Engineering workflows: AI workflow requires goal-aligned update.

Verification and validation testing:

  1. Design tests align with expectations, prevent harms
  2. Goals of AI are more general or high-level than traditional software programs, so we need tests that are designed with user expectations rather than solely the technical details.

Bias testing to enhance fairness:

  1. Test training data for opacity, scale, harm.
  2. Use specialized tools for continuous monitoring.
  3. After we have a trained model, we still need testing to check the risk, and may need a specific team in the organization or external company to test safety of model (should be continuous).

Explainable user interfaces:

  1. Are difficult to achieve
  2. Ensure system explainability for user understanding, meeting legal requirements
  3. Intrinsic and post hoc explanations aid developer improvement.
  4. Design a comprehensive user interface, considering user sentiments
  5. Post hoc: no information about the technical details of the model, but rather need a broad level idea of the system

There are five principles to build safety cultures, which are mostly top-down approaches (see slides).

Leadership: create a safe team, make commitment to safety that is visible to employees so they know leaders are committed to safety.

Long-term investment: need safe developers to develop safe models.

Public can help monitor and improve as it creates public/external pressure, so companies may work harder to eliminate issues.

Internal Review Boards engage stakeholders in setting benchmarks and to make improvements for problems and future planning.

Trustworthy certification by independent oversight:

  • Purpose: Ensure continuous improvement for reliable, safe products. Helps to make a complete, trustworthy system.

  • Requirements: Respected leaders, conflict declaration, diverse membership.

  • Capacity: Examine private data, conduct interviews, issue subpoenas for evidence.

Independent oversight is structured around three core methods:

  1. Planning
  2. Monitoring
  3. Conducting reviews or retrospectives

There are five paths for Trustworthy certification

  1. Government: Policy and Regulation, aligning with EU’s seven key principles(list on the top right) for transparency, reliability, safety, privacy, and fairness

  2. Accounting Firms: Beyond the internal audits mentioned previously, external bodies should audit the entire industry

  3. Insurance Companies: Adapting policies for emerging technologies like self-driving cars (details on next slide)

  4. Non-government organizations: prioritizing the public’s interest

  5. Professional organizations and research institutes

As an activity, we tried role playing where each group will play different roles and think about following 15 principles in terms of “ethical AI”.

Ethical Team:

  1. Diagnosis for skin cancer, dataset quality is reliable (bias-skin color, state-laws passing for collecting data)
  2. Various Metrics for evaluating AI
  3. Come up an agreement with patients, doctors

Healthcare Management/Organization:

  1. Reporting failures (missed diagnosis) for feedback
  2. Data security, gathering FP, FN cases for further training
  3. Educating staff
  4. Establishing accuracy/certainty of threshold for AI diagnosing skin cancer, checking the standard of professional verification

Independent oversight committee:

  1. Whether the dataset is not biased in every stage and is representing all race, gender, etc
  2. Data source should be considered carefully (online, hospital)
  3. Model explanation and transparency should be considered
  4. Privacy of personal information of both the dataset and the users

There are 15 principles each group can take into consideration for the role-playing discussion.

Reorienting technical R&D emphasizes oversight, robustness, interpretability, inclusivity, risk assessment, and addressing emerging challenges.

Proposed governance measures include enforcing standards to prevent misuse, requiring registration of frontier systems, implementing whistleblower protections, and creating national and international safety standards. Additionally, the accountability of frontier AI developers and owners, along with AI companies promptly disclosing if-then commitments, is highlighted.

There are some ethical platforms for developing responsible AI product

  1. SUM Values: to provide a framework for moral scope of AI product
  2. FAST Track Principles: to make sure AI project is fair, bias-mitigating and reliable
  3. PBG Framework: to set up transparent process of AI product

Putting the Ethical Platform into Practice needs three key steps: reflect, act and justify:

  1. Reflect using the SUM values: asking and answering questions about ethical purposes and assess the impacts of AI project
  2. Act using FAST TRACK Principles: ensure every step of development produces safe, fair AI innovation
  3. Justify Using the PBG Framework: set up governance process to ensure model transparency

Team 1

There are many trajectories that AI development could take, so it would be very difficult to completely discount something as a possibility. Related this to “Dark Matter” book by Blake Crouch.

Risk would primarily come from bad actors (specifically humans). Briefly touched on ‘what if the bad actor is the AI?’

Team 2

The potential downfall of humans would not be due to AI’s maliciousness.

In the post-autonomous era, concerns shift to the misuse of models for harmful purposes.

Team 3

The second question seems to be already happening.

Given the rapid technological progress in recent years, single prompt can result in losing control over AI models, and speculations around ‘Q*(Q-Star)’ suggest risk in losing control over AI models, however AI’s power-seeking behavior may still be overstated.



(Post response by Tuesday, 28 November)

Paper 1: Bridging the Gap Between Ethics and Practice

  1. The paper claims, “Human-centered Artificial Intelligence (HCAI) systems represent a second Copernican revolution that puts human performance and human experience at the center of design thinking." Do you agree with this quote?
  2. Developers/teams, organizations, users and regulators often have different views on what constitutes reliability, safety, and trustworthiness in human-centered AI systems. What are the potential challenges and solutions for aligning them? Can you provide some specific examples where these views do not align?

Paper 2: Managing AI Risks in an Era of Rapid Progress

  1. Do you think AI systems can be regulated over an international governance organization or agreement like nuclear weapons?
  2. Consider this quote from the paper: “Without sufficient caution, we may irreversibly lose control of autonomous AI systems, rendering human intervention ineffective. Large-scale cybercrime, social manipulation, and other highlighted harms could then escalate rapidly. This unchecked AI advancement could culminate in a large-scale loss of life and the biosphere, and the marginalization or even extinction of humanity.” Do you agree with it? If so, do you think any of the measures proposed in the paper would be sufficient for managing such a risk? If not, what assumptions of the authors’ that led to this conclusion do you think are invalid or unlikely?

Week 14a: Multimodal Models

Presenting Team: Aparna Kishore, Elena Long, Erzhen Hu, Jingping Wan
Blogging Team: Haolin Liu, Haochen Liu, Ji Hyun Kim, Stephanie Schoch, Xueren Ge

Note: since the topics were unrelated, Week 14 is split into two posts:

Monday, November 27: Multimodal Models

Today’s topic is how to improve model performance by combining multiple modes.

We will first introduce the multimodal foundations and then center around CLIP, which is the most famous vision-language model.

We live in a multimodal world, and our brains naturally learn to process multiple sensory signals received from the environment to help us make sense of the world around us. More specifically, vision is a large portion of how humans perceive, while language is a large portion of how humans communicate.

When we talk about vision-language, there are two types of interations to consider: one is how can we produce visual data, and another is how can we consume visual information.

For visual generation, popular models include GAN and diffusion models. What makes it multi-modal is that we can use other modalities to control the image we want to generate, for example, the text-to-image methods that can use text-conditioned visual generation, such as stable diffusion.

Another approach focuses on visual understanding, which studies how can we consume the visual information from the image, and further, how can we consume the audio, image, and different modalities from our surrounding environment.

Greg Brockman, who is one of the founders of OpenAI, showed ChatGPT a diagram of my joke website, which he sketched with a pencil. Then, ChatGPT outputs a functional website. This is quite remarkable as you can start to plug images into the language models.


When we see the text “Sunshine, Sunny beach, Coconut, Straw hat”, we can visualize a picture of a beach with these components. This is because our mind not only receives multimodal information but also somehow aligns these modalities.

Now we move to the detailed algorithm of vision-language models. There are particular vision-language problem spaces or representative tasks that these models try to solve.

The first question is how to train a vision-language model. We will discuss supervised pre-training and contrastive language-image pre-training, which is also known as CLIP.

Supervised learning will map an image to a discrete label that is associated with visual content. The drawback here is that we always need labeled data. However, human annotations can be expensive and labels are limited.

The supervised learning method was deployed first. In 2021, OpenAI released Dall-E, which is a generative model that uses transformer architecture like GPT3. The model receives both text and image in the training process, and it can generate images from scratch based on natural language input.

As seen in the images above, it can combine disparate ideas to synthesize objects, even some of them are unlikely to exist in the real world.


Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. arXiv 2021. PDF

Different from Dall-E, CLIP takes an image and text and connects them in non-generative way. The idea is that we can take an image, and the model can predict the text along with it.

Traditional image classification models are trained to identify objects from a predefined set of categories, for example, there are about 1000 categories in the ImageNet challenge. CLIP is trained to understand the semantics of images and text together. It is trained with a huge amount of data, 400 million images on the web and corresponding text data, and it can perform object identification in any category without re-training.

Since CLIP was trained using a combination of image and text, the training data is a batch of (image, text) pairs.

On top we have labels that belong to each image, the model tokenize it, passes it to text encoder, performs linear projection, and passes it along to a contrastive embedding screen. It does the same for images.

Then, in the contrastive embedding screen, the model takes the inner product of image vector and text vector. In contrastive learning, we want to increase the values of these blue squares to 1, which are original image and text pairs, and decrease the values of the white squares, which are not the classes they belong to. To achieve this, they compute the loss of these image-text vectors and text-image vectors and do back propagation

We now elaborate more on the loss function of this training process. We have two vectors (text, image) here, $v$ represents the text vector, and $u$ represents the image vector, and $\tau$ here is a trainable parameter.

In the first text-to-image loss function, they take the cosine similarities of these two vectors, sum up all the rows in the denominator and normalized via softmax. As we can see it is an asymmetric problem, so to compute the image-to-text loss function, they sum up all the columns instead of the rows

After that, they compute a cross-entropy loss of these two probability distributions, sum up all the batches, and then average it out to get the final loss function.

After pre-training, the second and third step are for the object identification. We have a new dataset with different classes and want to test CLIP on it. In step 2, we need to pass these classes to the pre-trained text encoder. Instead of passing class names alone, they use a prompt template, making a sentence out of these class names. Then the model will perform the same linear projection as in the pre-training and pass it into a contrastive space

Then in Step 3, we take the image we want to predict, pass it into the image encoder, do linear projection, go into the contrastive embedding space and take the inner products of this image vector and all text vectors in Step 2. The final prediction output will be the one that has the highest cosine similarity.

The authors share three main ideas behind this work:

  1. The need of a sufficiently large dataset. The simple truth is that existing manually labeled datasets are just way too small (100k samples) for training a natural language supervised model on the scale of GPT. The intuition is that the required dataset already exists on the web without the need to label data manually. So they created a new dataset of 400 million (image, text) pairs collected from a variety of publicly available sources on the Internet.

  2. Efficient pre-training method. After experimenting with class-label prediction, the authors realized that the key to success was in predicting only which text as a whole is paired with which image, not the exact word of that text. This discovery led to the use of the loss function we introduced earlier, such that the cosine similarity for each correct pair of embeddings is maximized, and the cosine similarity of the rest of the pairings are minimized.

  3. Using transformers. After some experiments, they selected a transformer as text encoder, and leave two options for image encoder. The image encoder is either a Vision Transformer or a modified ResNet-D with attention pooling instead of global average pooling.

The figure above shows that CLIP is by far much more data-efficient than the other methods.

With prompt engineering and ensembling, models are more likely to achieve higher accuracy score rather than just simply having contextless class names.

One observation is that CLIP performs poorly on differentiating word sense when there’s only a label without context. For example, the label “crane” can mean construction crane or a crane that flies.

Another observation is that in their pre-training dataset text, it is relatively rare to see an image with just a single word as a label. So to bridge the distribution gap, they use a prompt template. Instead of a single label, they use template sentences like “a photo of a label”. They also found customizing the prompt text to each task can further improve performance.

As we can see the error rate decreases smoothly as a function of model compute. However, they do note that there is lots of variance, this curve is the average. For individual datasets, it varies widely. It may be due to how the dataset is selected, how the prompt is engineered, or other unknown factors.

For the evaluation in terms of performance, CLIP is compared with a linear probe on ResNet50. It is pretty impressive that zero-shot CLIP outperforms a fully trained model on many of the dataset, including ImageNet.

On the other side, CLIP is weak on several specialized, complex, or abstract tasks such as EuroSAT (satellite image classification), KITTI Distance (recognizing distance to the nearest car). This may because these are not the kinds of text and image found frequently on the Internet, or that these tasks are different enough from common image tasks but simple enough for a custom-trained model to do well.

Here we compare zero-shot CLIP with few-shot linear probes. This is where pre-training really comes in, as the model only see few examples per class.

Surprisingly, a zero-shot CLIP is comparable to a 16-shot BiT-M model, which is one of the best open models that does transfer learning in computer vision. If we linear probe the CLIP model, then it way outperform these other linear probe models.

They also evaluate CLIP in terms of its robustness to perturbations. Here they compare zero-shot CLIP to models that have been trained on ImageNet, finding that zero-shot clip matches the performance of ResNet-101. As this classifier degrades as we go for harder and harder datasets overall, CLIP is more robust. This suggests that representation in CLIP should be nuanced enough, so it can pick up on different features than only distinguishing banana from other classes in the ImageNet dataset.

Here, they customize zero-shot CLIP to each dataset (adapt to class shift in purple) based on class names. While this supervised adaptation to class shift increases ImageNet accuracy by around 10 percent, it slightly reduces the average robustness. From the right side, the improvements are concentrated on only a few datasets.

On the other hand, when they adapt CLIP to fully supervised logistic regression classifiers on the best CLIP model’s features, it comes close to the standard ImageNet training in terms of robustness. Thus, it seems that the representation itself in zero-shot CLIP has more value with more stability and nuance.

There are various works following CLIP based on this contrastive learning structure. The first extension is to further scale up texts and images. The second is to design better models.

Reproducible Scaling Laws

Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning. CVPR 2023. PDF

This paper used the open larges scale LAION-2B dataset to pre-train OpenCLIP across different scales.


Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, Eyal Orgad, Rahim Entezari, Giannis Daras, Sarah Pratt, Vivek Ramanujan, Yonatan Bitton, Kalyani Marathe, Stephen Mussmann, Richard Vencu, Mehdi Cherti, Ranjay Krishna, Pang Wei Koh, Olga Saukh, Alexander Ratner, Shuran Song, Hannaneh Hajishirzi, Ali Farhadi, Romain Beaumont, Sewoong Oh, Alex Dimakis, Jenia Jitsev, Yair Carmon, Vaishaal Shankar, Ludwig Schmidt. DataComp: In search of the next generation of multimodal datasets. arxiv 2023. PDF

This paper talked about how should we scale data? Should we scale it up with noisier and noisier data?

Their focus is to search the next-generation image-text datasets. Instead of fixing the dataset and designing different algorithms, the authors propose to fix the CLIP training method but vary the datasets instead. With this method, they come up with a high-quality large-scale dataset.


Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, Chunjing Xu. FILIP: Fine-grained Interactive Language-Image Pre-Training. ICLR 2022. PDF

FILIP scales CLIP training via masking. It randomly masks out image patches with a high masking ratio, and only encodes the visible patches. It turns out this method does not hurt performance but improves training efficiency


Sheng Shen, Chunyuan Li, Xiaowei Hu, Jianwei Yang, Yujia Xie, Pengchuan Zhang, Zhe Gan, Lijuan Wang, Lu Yuan, Ce Liu, Kurt Keutzer, Trevor Darrell, Anna Rohrbach, Jianfeng Gao. K-LITE: Learning Transferable Visual Models with External Knowledge. NeurIPS 2022. PDF

Another line of work focuses on improving the language side model design of CLIP. The model K-Lite utilizes the Wiki definition of entities together with the original alt-text for contrastive pre-training. Such knowledge is useful for a variety of domains and datasets, making it possible to build a generic approach for task-level transfer.

Recall that in the motivating example, we argue that more modalities will enhance the learning process.


Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra. ImageBind: One Embedding Space To Bind Them All. arxiv 2023. PDF

Imagebind tries to use more modalities to improve performance. However, one challenge here is that not all generated data are naturally aligned due to the lack of a corresponding relationship in the training set.

For ImageBind, there are different modalities include image, text, video, audio, depth, thermal, and IMU, which contains the accelerator, and gyroscope data. The goal of ImageBind is to learn a single joint embedding space for all the modalities, and then use image as the binding modality. Here I denotes image modality, and M denotes all the other modalities. They use deep neural networks as encoders to extract embeddings from each of the modalities, so each modality has it own encoder, just like CLIP.

During the training, the image and text modality was kept frozen, and the weights of other modalities were updated, and this freezing shows the alignment to emerge between other modalities for which we don’t have any natural alignment, for example, between audio, and depth.

The preprocessed inputs are passed through encoders and then passed through a simple linear layer to make sure they are of same dimension before being trained with the loss called infoNCE loss. This loss is a modified cross-entropy loss, which extends the contrastive learning to multiple modalities. Let the output for image be q, and the output for other modalities be k. The loss here tries to align image modality with all other modalities.

They study whether ImageBind’s embeddings can be used to compose information across modalities. The above figure shows image retrievals obtained by adding together image and audio embeddings. The joint embedding space allows for us to compose two embeddings: e.g., image of fruits on a table + sound of chirping birds and retrieve an image that contains both these concepts, i.e., fruits on trees with birds. Such emergent compositionality whereby semantic content from different modalities can be composed will likely enable a rich variety of compositional tasks.

By utilizing the audio embedding of ImageBind, it is possible to design an audio-based detector that can detect and segment objects based on audio prompts.

As proposed in CLIP, replacing labels with textual descriptions and using a text encoder to encode them can feasibly convert closed-set problems to open-set ones. A number of works have been proposed to transform different computer vision tasks by replacing the label space with language space.

For the first question, we believe there are several differences between humans and machines for cognition. Although these models will outperform humans on several specific tasks, they also have limitations. For example, humans will perceive an image as a whole, but machines perceive it pixel by pixel. This ensures humans are good at using context to interpret images and text. While these models can recognize patterns and correlations between words and images, they may not fully grasp the broader context as humans do.

For the second question, the presenter gave an example that there is a specific food in Wuhan called “hot dry noodles”. When we give a picture of this kind of noodles with the caption “hot dry noodles in Wuhan”, the multi-mode models will output how this food is popular in Wuhan. However, if we replace the caption as “hot dry noodles in Shandong”, the model will still describe this noodles in Wuhan instead of Shandong. The presenter believes this is an example of bias because a lot of data on this noodles is associated with Wuhan. Thus, even though the caption of the image is changed, the model can not comprehend because the representation is fixed.

Readings and Discussion Questions

Monday 27 November: Transferring and Binding Multi-Modal Capabilities:

Readings for Monday:


(Post response by Sunday, 26 November)

  1. What are some potential real-world applications of CLIP and ImageBind? Could these technologies transform industries like healthcare, education, or entertainment?
  2. How do CLIP and ImageBind mimic or differ from human cognitive processes in interpreting and linking visual and textual information?
  3. What are potential challenges in creating datasets for training models like CLIP and ImageBind? How can the quality of these datasets be ensured?
  4. What are the potential ethical implications of technologies like CLIP and ImageBind, especially in terms of privacy, bias, and misuse? How can these issues be mitigated?

Week 13: Regulating Dangerous Technologies

The slides are here: Regulating Dangerous Technologies (I’ve included some slides in the posted slides that I didn’t present in class but you might find interesting, including some excerpts from a talk I gave in 2018 on Mutually Assured Destruction and the Impending AI Apocalypse.)

Since one of the groups made the analogy to tobacco products, I also will take the liberty of pointing to a talk I gave at Google making a similar analogy: The Dragon in the Room.

Stephanie made the point after class about how important individuals making brave decisions is to things working out, in particular with humanity (so far!) avoiding annihilating ourselves with nuclear weapons. Stanislav Petrov may well have been the single person between us and nuclear destruction in 1983, when he prevented an alert (which he correctly determined was a false alarm) produced by the Soviet detection system from going up the chain.

Here’s one (of many) articles on this: ‘I Had A Funny Feeling in My Gut’, Washington Post, 10 Feb 1999. There is still a lot of uncertainty and skepticism if we should be fearing any kind of out-of-control AI risk, but it is not so hard to imagine scenarios where our fate will similarly come down to an individual’s decision at a critical juncture. (On the other hand, this article argues that we shouldn’t oversensationalize Petrov’s actions and there were many other safeguards between him and nuclear war, and we really shouldn’t design extinction-level systems in a way that they are so fragile to depend on an individual decision: Did Stanislav Petrov save the world in 1983? It’s complicated, from a Russian perspective.)

Week 12: LLM Agents

Presenting Team: Liu Zhe, Peng Wang, Sikun Guo, Yinhan He, Zhepei Wei

Blogging Team: Anshuman Suri, Jacob Christopher, Kasra Lekan, Kaylee Liu, My Dinh

Monday, November 13: LLM Agents

LLM agents are the “next big thing”, with the potential to directly impact important fields like healthcare and education. Essentially, they are LLM-based systems that have the ability to use external tools, such as Internet browsing access and calculators, to augment their abilities.


Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, Thomas Scialom. Toolformer: Language Models Can Teach Themselves to Use Tools. arXiv 2023. PDF

LLMs have limitations that can potentially be addressed with these “tools”:

  • Outdated information: LLMs cannot access up-to-date information without access to external sources. Giving them the ability to access realtime information (via Internet queries) would lead to better responses, such as “who is the President of USA today?”
  • Hallucination: External knowledge sources can help ground generation in facts and work to supplement the model’s knowledge, reducing the possibility of hallucinating.
  • Lack of mathematical skills: Access to a calculator can help model generate correct responses and computations involving math. Using zero-shot learning can help reduce hallucination, but providing access to a calculator (assuming it is used correctly) can guarantee correct responses.

Other limitations include limited multi-language usability, having no concept of “time”, etc.

Key Contributions

The main idea is to develop a system that has the ability to use external tools (translation, calendar, search engine, etc.). The key lies in knowing when to use a tool, which tool to use, and how to use it. Training is self-supervised, unlike other capability-enhancing techniques like RLHF.

Data Collection

Key step: generating candidate API calls via in-context learning. The method starts with examples generated by humans, e.g. in-context examples for “Coca-Cola”, etc.

$k$ positions are sampled at random from the text to serve as “candidates” for adding <API> tags.

Tokens up to the position with an “” tag are provided to get $m$ candidate API calls.

An additional weighted loss term is introduced, corresponding to the utility of information added after using candidate API calls. This loss term is meant to provide feedback for which API calls were useful for some given context.

Given the loss term and general strategy for inserting <API> tokens, the model is fine-tuned with the augmented dataset. At prediction time, the model uses a variant of greedy decoding, making API calls if the <API> tag is in the top-k predictions at any token position.

Professor Evans talked about how the method could benefit from having some “feedback” from the API’s quality of response, and not having an implicit bias in the design that considers API calls as “costly”.

Interestingly, performance for some cases (ASDiv, Table 4) is better for the version with disabled API calls (so no agent-like behavior) than the variant equipped with API-returned information.

Scaling-law Experiments

  • For small model sizes, performance does not change much with the inclusion of external knowledge.
  • The utility of API calls is clearer for larger models, where performance drops significantly when API calls are disabled.

In terms of limitations, these tools cannot be used “in chain” (an in iterative-refinement approach, where multiple API calls are made) and require sampling a lot of data.


Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao. ReAct: Synergizing Reasoning and Acting in Language Models. ICLR, 2023. PDF

Research on reasoning and acting has been detached from each other. This work allows LLMs to generate both reasoning traces and actions.

Learning based on fine-tuning and prompting (ReACT prompting strategy, uses reasoning & action steps together as prompt). The new few slides (below) talk about different parts of ReACT via secific examples, showing how just actions or reasoning in isolation are not sufficient for good agents.

Only when these two are combined together do we get powerful LLM agents:

Reasoning and acting together create an augmented action space, which is key to unlocking these models’ capabilities.

A Survey on Large Language Model based Autonomous Agents

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, Ji-Rong Wen. A Survey on Large Language Model based Autonomous Agents. arXiv, 2023. PDF.

The survey breaks down the agent construction pipeline into four components/modules: profiling, memory, planning, and action.


  • Handcrafted: captures the role of agent properly and allows for flexibility, but labor-intensive.
  • Using LLMs: starts with profile generation rules (can specify via few-shot examples), controllable seeding for profiles.
  • Dataset Alignment Method: foundation of agent design, and has significant influence on the following 3 modules.


Structures: Unified memory is short-term and simulates our “working memory” (added via context), while hybrid combined short-term and long-term memory tries to model human recollection better.

Formats: natural language is interpretable and flexible. Embeddings compromise on this flexibility, with the added benefit of being very efficient. Databases allow efficient manipulation of “memories”, and structured lists can also be used.

Operations: Memory reading allows for weighted retrieval of information, with operations for reading (memory reflection) and updating (memory writing) information.


Without feedback, planning may proceed via single reasoning (dependent, connected steps), multi-path reasoning (tree-like structure, kind-of approximates human thinking?), or using external planners (using domain-specific planners).

Similarly, planning with feedback may rely on information from humans (e.g. RLHF), environmental feedback (e.g. RL for game simulation), or model feedback (using other pre-trained models).


  • Agents can have different targets: task completion, communication (communicate with other agents/humans), or exploration (explore vs. exploit tradeoff).
  • These actions may be produced via memory recollection (using short-term or hybrid memory), or following generated plans.
  • Their exploration space may include API calls, or internal knowledge.

Impact: These agents can directly change the environment (e.g. starting a calculator service), their own states (e.g. recollection), or trigger actions in other agents (e.g. a chatbot agent calling a legal-information agent)

Subbarao Kambhampati. Can LLMs Really Reason and Plan?. Communications of the ACM Blogpost, 2023.

The blogpost discussions use Blocksworld as a benchmark. Blocksworld defines rules, goals, and allowed actions etc. via natural language, expecting a set of instructions in response.

Performance seems pretty good with GPT-4 (Left, ~35%) but when names are obfuscated (Right), plan generation results drop to 0-2%.

Professor Evans talked about how the benchmarks are not a comparison with human performance, which would also understandably go down when names are obfuscated. It is thus unclear whether these drops in performance are expected (given that humans are bad at the modified task as well), or a result of the model not really “knowing” how to solve the given task. An alternate explanation for these results, would just be that the model has a hard time identifying entities that are labeled with non-sensical, multi-token strings that don’t revaal them to be blocks. That said, there is tons of data about Blocksworld in the training data, so a difficult domain to test what the model is really learning (if anything).

In-class Discussion

  • What are your thoughts on LLM reasonig/planning? We talked how in psychology, reasoning is divided into 3 domains (knowledge acquisition, reasoning, decision making). Even for the literature in this field, it is unclear how these three domains interact with each other, and thus even more complicated for LLMs.
  • How should we proceed with this line of research? We acknowledged how it is difficult to define “planning” for both humans, and even more so for LLMs. Professor Evans mentioned that for this line of work to advance, we need to come up with a good benchmark (but this is very labor-intensive). Students recalled work on performing activities in Minecraft as a useful benchmark for planning and agents. The “granularity” of planning is much more nuanced - humans can infer steps in between (or use domain knowledge), but harder if talking about agents or “just LLMs”. At the same time, we do not have a good answer for “should we expect our model to behave more like a hard-coded program or like a human (performance changes due to new factors, ex. Semantic changes, etc)?”

Wednesday, November 15: Applications of LLM Agents

The experimental setup comprises two restaurants, serving as competitive agents, and fourteen customers, acting as judge agents. To confine the action space of the Large Language Model (LLM), a management system is employed. This system functions as a question provider, formulating precise inquiries for the LLM to ensure that its output remains within the defined action space. The customers exercise their judgment when evaluating offers from both restaurants, ultimately selecting based on their individual constraints and requirements.

Data inputs: Daybook provides data regarding the previous day’s patronage, menu evaluation, and related insights. Likewise, Rival offers comparable information concerning the competitor’s restaurant, encompassing visitor statistics and menu alterations. Customer feedback is used to make decisions about the next day.

Discussion Notes:

  1. LLM scores can act as a baseline, but there is always a possibility of bias. For instance, changing the order of options presented to the model may sometimes result in a different score being outputted.
  2. Designing a model based solely off of customer/restaurant data fails to capture other experiences of dining (i.e. customer service, environment/ambience, etc.) and thus indicates the simulation’s low fidelity. Capturing decision-making factors in customers is especially difficult, as they are difficult to define and quantify. The current simulation does not account for customers’ risk-aversion for trying new dishes, and it also does not consider the influence of star ratings or reviews on customers’ decisions to choose between the two restaurants. There may also be issues with prompt-based tasks, such as over-generalization.
  3. Utilizing simulations has the potential for real-world social trends and phenomena to be reproduced without requiring a large number of real people or complex variables; it is not necessary to recreate an entire town in order to gain insights into real-world trends.

Agents are able to learn from each other while maintaining differentiation. This is visible in two ways:

  1. Agents imitate observed strategies that provide a high reward. For example, a restaurant may copy the popular dishes of another restaurant to compete for their clinetele.
  2. Conversely, differentiation is used to attract patrons that the competing agents don’t specifically cater to; one restaurant may provide inexpensive food for customers on a budget while another provides fine-dining options.

The agents are shown to adapt to various customer needs in an effort to retain or attract further patronage.

A number of sociological and economic princples were demonstrated in the experiment.

  1. Is competition among agents the best mechanism to take advantage of their capabilities? What are the limitations of this approach?
  2. What other interactions are feasible?
  3. What are the benefits and risks and/or pros and cons of these interactions as compared to competition among agents?

Collaborative Approach Limitations: One potential drawback of adopting a collaborative approach is the propensity for bias in a single agent to propagate through multiple agents, thus amplifying its impact.

Employing Negotiation-Based Tasks and Games: In the context of collaborative endeavors, employing negotiation-based tasks and games is a valuable strategy. These involve the participation of diverse agents, such as a managerial figure, a chef, and a waiter, each representing distinct stakeholders. The amalgamation of their inputs contributes to a holistic decision-making process.

The Feasibility of Restaurant Collaboration: We explored the possibility of restaurants engaging in collaborative efforts, including the exchange of information regarding signature dishes, the potential collusion to elevate pricing structures collectively, and the coordination of operational hours. However, it is essential to consider potential drawbacks, particularly the willingness of competitors to engage in such cooperative ventures.

  1. Limitations of having collaborative approach: bias in one agent might cascade into bias in multiple agents.
  2. Discussed negotiation-based tasks and negotiation games to collaborate with each other. For instance, one could have an ensemble of different agents (i.e. manager agent makes final decision, chef has a say, waiter has a say, etc.) Each agent represents different stakeholder
  3. Discussed how restaurants could collaborate together, e.g. communicate signature dishes, collude to raise prices of everything, coordinate times they are open. Noted potential downsides, including willingess to collaborate and power dynamics between agents.

This work explored learning through collaboration via multiple types of interaction as shown in the next slide.

LTC Communication Patters:

  1. Monologue: Letting an agent train by providing feedback to itself. The agent will play the role of the actor and the instructor.
  2. Dialogue: As opposed to the previous approach, training is conducted with separate agents acting as the actor and the instructor.
  3. Analogue: Similar to the former approach, but raining rewards and examples are provided by the instructor agent rather than by the environment.

The agent model is optimized with the trajectory data collected in the exploration phase. This relies on a multi-objective loss function composed of a standard loss function for unsupervised language model training and a reinforcement objective to maximize the expected reward from previous communication data. Beta acts as a balancing hyper-parameter.

Readings and Discussion Questions

Monday 13 November: Introduction to LLM Agents



(Post response by Sunday, 12 November)

  1. What are the key methodologies or techniques used in the Toolformer paper, and how does the tool use of LLM differ from the existing use of LLM, e.g., prompting, demonstration, etc.?
  2. Which potential applications or industries could benefit (or suffer) the most from the LLM Agent concept? How might it revolutionize or impact these areas?
  3. Regarding Can LLMs Really Reason and Plan?, do you agree with the opinion that what LLMs really do is a form of universal approximate retrieval, which was sometimes mistakenly interpreted as reasoning capabilities? What is your perspective on this question?

Wednesday 15 November: Applications of LLM Agents



(Post response by Tuesday, 14 November)

  1. The CompeteAI: Understanding the Competition Behaviors in Large Language Model-based Agents paper shows that LLM agents can be used for simulating the competition environment. How might the competition behaviors observed in LLM-based agents translate to other real-world applications where strategic competition is critical? Essentially, are there specific characteristics unique to the restaurant setting that might not directly apply to other sectors?
  2. What are some considerations (ethical or otherwise) that may arise as a result of programming LLMs to compete with each other, especially considering the possibility of this being implemented in real world scenarios? If there are valid concerns, how could the models be calibrated to ensure that the competition remains ethical, preventing the agents from learning and adopting potentially harmful or deceptive strategies?
  3. Agents can be used in various ways. One way is to make them compete (like in the CompeteAI paper). Instead of competing, how can agents be used in other ways (e.g. by collaborating/communicating with each other), and how might this impact their behavior?
  4. Given the adaptive nature of LLM-based agents in a competitive environment, how can we ensure transparency and interpretability in the decision-making processes of these agents, so that stakeholders can understand and trust the outcomes of such simulations?

Week 11: Watermarking on Generative Models

Presenting Team: Tseganesh Beyene Kebede, Zihan Guan, Xindi Guo, Mengxuan Hu

Blogging Team: Ajwa Shahid, Caroline Gihlstorf, Changhong Yang, Hyeongjin Kim, Sarah Boyce

Monday, November 6: Watermarking LLM Outputs

Recent instances of AI-generated text passing for human text and the writing of students being misattributed to AI suggest the need for a tool to distinguish between human-written and AI-generated text. The presenters also noted that the increase in the amount of AI-generated text online is a risk for training future LLMs on this data.

A proposed solution is to embed a watermark in the output of text generation models.

John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, Tom Goldstein. A Watermark for Large Language Models. 2023. [PDF]

Token-based watermarking: given a word in a sequence, token-based watermarking uses a hash function to initialize a random number generator used to create two sets of all possible next words: the “green” word list and the “red” word list.

The algorithm from the paper uses the language model probabilities to separate words using a hash function-based random number generator.

The idea is that the more words in the greenlist, the more likely the text is AI-generated:

This approach is limited, however. The entropy of a particular token could determine how well the watermark works:

Soft Watermarking

Soft watermarking lessens the impact of the red list on low-entropy tokens (which are almost certainly guaranteed to follow the current token) by encoding some flexibility in a “hardness parameter” δ for the green tokens:

With regard to search techniques for watermarked text, beam search improves performance:

The class then split into three groups to discuss the following questions:

  • Is watermarking unfair to us, especially in academic settings?
  • Who should have access to the detection tool? Should it be available to everyone?
  • What are your thoughts on artificial intelligence regulations? And do you believe/think we can/should tame AI’s power through stiff regulatory control?

Attacks on Watermarks

They then explain in more detail the impossibility of detection and the main intuition behind the trade-off:

The main intuition is that the Sentences given to a paraphrasing tool will not be detected as AI but sentences inputted to the LLM may be detected as AI. The output source for an LLM is limited than doing paraphrasing because Paraphrased Sentences (PS) would have a larger set. Why is the paraphrased sentences set larger than the LLM sentences (LS) set? That is because LLMs try to maintain the same meaning and that limits their performance.

If LS becomes as large as the PS, this will cause Type 1 error because it becomes increasingly hard to detect PS.

If PS goes close to LS, this will cause Type 2 error because it would become increasingly hard to detect the LS now.

A discussion question was put forward in class as to why are we considering this as human-generated text when human is using the feedback from the model to create spoof attacks.

The class talked more about if it is misinformation, does it matter if its AI-generated or not? What is more important is that it should be marked as misinformation, not that if it is AI generated or human crafted.

Are there cases where we actually care about watermarks? And one case is where an AI tool writes a book and publishes it. Maybe the problem is volume of the text generated more than the content. This causes a loss to human creators and poses unbeatable competition in terms of speed. The detection is more about the volume than it is about the use of it in one instance.

Wednesday, November 8: Watermarking Diffusion Models

Topic for Wednsday is Watermaking on Diffusion Models

Diffusion Model

Jonathan Ho, Ajay Jain, Pieter Abbeel. Denoising Diffusion Probabilistic Models. NeurIPS 2020. [PDF]

But first focus on how diffusion models in general

How it works

Diffusion models generate images by removing some level of noise for every iteration

How it works

At every iteration, the model receive a noisy image, current iteration number, and generate a less noisy image for the next iteration.

What is inside the denoise module?

Inside the model, there is a noise prediction module that predicts the noise. The model will then subtract the predicted noise from the image.

Training the noise predictor

But then how to train the model?

Generating Training Data

By generating some noise images as groudtruth, similar to the denoising process.

Traning the Noise Predicter

At each iteration, add noise to the image.

Algorithm of denosing diffusion probabilistic model

The loss function on a high level is to minimize the difference between the true noise and the predicted noise

There is a reparameterization trick that you can generate noise for any iteration in one step

The mathematical proof for the trick... (left as exercise for the reader)

The full proof

Maximum Likelihood Estimation

So the goal here is to maximumize the likelihood of generating images from similar distribution.

Computing p(x)

Denoising diffusion probabilistic models


Text to image generation works by adding the prompt at every iteration

Stable Diffusion


Some points that came up during discussion:

• GAN model is distribution to distribution vs dissusion is image to distribution

• The size of z is different for each model

• Need more time to train diffusion models

• GAN is less stable and managing gradient is not there with the other models

• Diffusion model is more robust to noise and is more controllable

• GAN can take in some contextual input but diffusion models are more flexible because they can take in more context

Watermarking Models

Yugeng Liu, Zheng Li, Michael Backes, Yun Shen, Yang Zhang. Watermarking Diffusion Model. 2023. [PDF]

Water Marking has become familiar to us on images but in general it is defined as proof of ownership so ideas and things can't be used without authorization

Stealing Models

Idea for Watermarking a Model

Here watermarking is occuring by embedding a specific behavior into a model


Netflix can monitor other models and see if they have similar outputs by putting an output that would be unlikely to occur normally.

Trigger words here should not effect the rest of the sentence.


NaiveWM uses the trigger word to generate a new image but it is very similar to the original.

Trigger length is the number of tokens (not the length of the word).


• Sometimes we can see the decrease in image quality with a watermark so there is a tradeoff between quality and watermarking.

• There will always be an adversary to figure out how to reverse the process of watermakring (or we should at least assume so), so this field still needs growth and more proof of irreversibility.

Readings and Discussion Questions

Monday 6 November: Detectable Watermarks for LLMs



(Post response by Sunday, 5 November)

  1. In “A Watermark for Large Language Models”, how robust is the watermarking framework against potential adversarial attacks and might an adversary do to disrupt the watermark while preserving useful quality text?
  2. The “A Watermark for Large Language Models” paper gives a list of properties a watermark should satisfy. Do you agree with this list of properties? Are their additional properties you think are important, or ones that they include that should be different?
  3. Do you see a future where watermarking can be useful and effective, even when there are adversaries with motivations to disrupt watermarks?
  4. Regarding watermarking and AI-generated text, what other methods or techniques do you believe could be investigated to strengthen the resistance of watermarked AI-generated text to paraphrase attacks?

Wednesday 8 November: Watermarking on Diffusion Models



(Post response by Tuesday, 7 November)

  1. After the development of diffusion models, they quickly replaced GANs in nearly all image generation applications. What are the biggest differences between diffusion models and GANs, and why have they been so successful?
  2. How are the required properties for watermarking a model similar and different from those for watermarking model outputs (like in Monday’s class)?
  3. In “Watermarking Diffusion Model”, the authors describe a clear threat model but don’t provide as clear a list of the required properties for a watermark as was in the “A Watermark for Large Language Models” paper. Can you provide a list of the required properties of a watermark that are implied by their threat model?

Week 10: Data Selection for LLMs

(see bottom for assigned readings and questions)

Presenting Team: Haolin Liu, Xueren Ge, Ji Hyun Kim, Stephanie Schoch
Blogging Team: Aparna Kishore, Elena Long, Erzhen Hu, Jingping Wan

Monday, 30 October:
Data Selection for Fine-tuning LLMs

Question: Would more models help?

We’ve discussed so many risks and issues of GenAI so far and one question is that it can be difficult for us to come up with a possible solution to these problems.

  • Would “Using a larger model to verify a smaller model’s hallucinations” a good idea?
  • One caveat would be “How can one ensure the larger model’s accuracy?

Question: Any potential applications of LLMs as a Judge?

As summarized from the Week 10 discussion channels, there could be many potential applications of LLMs as a Judge, such as accessing writing quality, checking harmful content, judging if social media post is factually correct or biased, evaluating if code is optimal.

Let’s start from Paper 1: Zheng, Lianmin, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin et al. “Judging LLM-as-a-judge with MT-Bench and Chatbot Arena.” arXiv preprint arXiv:2306.05685 (2023).

Multi-Turn questions provide a different way to query GPT. In this paper, the authors ask multi-turn dialogues to two different assistants.

The paper provides two benchmarks:

  1. The MT-bench provides a multi-turn question set, and it challenges chatbot to do difficult questions.
  2. The Chatbot Arena that provides a crowdsourced battle platform that reveals two options for human to choose.

The results shows that

  • GPT-4 is the best, which matches both controlled (MT-bench) and crowdsourced (Arena) human preferences.
  • Regardless of pair or single, it exceeds over 80% of agreement with the human subjects.

This shows that stong LLMs can be a scalable and explainable way to approximate human preferences. Some advantages and disadvantages were introduced for LLM-as-a-Judge.


  • Scalability: Human evaluation is time-consuming and expensive
  • Consistency: Human judgment varies by individuals
  • Explainability: LLMs can be trained to evaluate and provide reasons of their judgements


  • Position Bias: Bias towards responses based on their position (preferring the first response)
  • Verbosity Bias: Favor longer, more verbose answers regardless of accuracy
  • Overfitting: LLMs might overfit the training data, don’t generalize to new, unseen responses

The above potential of fine-tuning open-source model (there could be interpretability problems) leads to a question: “the paper motivates wanting to be able to emulate the performance of stronger, closed-source models - can we imitate these models?

Imitation Models

Paper 2: Gudibande, A., Wallace, E., Snell, C., Geng, X., Liu, H., Abbeel, P., Levine, S. and Song, D., 2023. The false promise of imitating proprietary LLMs. arXiv preprint arXiv:2305.15717.

To start the second paper, let’s do a class poll about which output people would prefer.

Most of the class chose Output A.

Here, let’s look into the definition of model imitation: “The premise of model imitation is that once a proprietary LM is made available via API, one can collect a dataset of API outputs and use it to fine-tune an open-source LM.1

The goal of model imitation is to imitate the capablity of the proprietary LM.

This shows that the Broad Imitation can be more difficult than Local Imitation as Local Imitation is only for specific tasks. However, Broad Imitation requires gathering of a diverse dataset and the imitation model needs to capture that distribution to have similar output.

To build imitation datasets, there are two primary approaches:

  • Natural Examples: If you have a set of inputs in mind, (e.g. tweets about a specific topic), use them to query the target model.
  • Synthetic Examples: Prompt the target model to iteratively generate examples from the same distribution as an initial small seed set of inputs.

The paper used two curated datasets:

    1. Task-Specific Imitation Using Synthetic Data: NQ Synthetic
    1. Broad Imitation Using Natural Data: ShareGPT-Mix

For NQ Synthetic:

For the ShareGPT-Mix dataset which includes three sources:

Note: the paper above presents the data using frequency with too few data points, and the numbers in the table don’t seem to make sense (e.g., no way to get 31% out of 50 queries).

Using the two datasets, the authors look into two research questions —
How does model imitation improve as we:

  1. Scale the data: Increase the amount of imitation data (including fine-tuning with different sized data subsets)
  2. Scale the model: Vary the capabilities of the underlying base model (they used the parameters in the model used as a proxy for base-model quality, regardless of the architecture)

Here’s the experiment setup:

The authors use Amazon Mturk using ChatGPT versus Imitation models ($15 an hour and assume that having that will help the quality of annotations). They also used GPT-4 to evalute both.

For the preference evalution (Figure below):

  • (Left graph) Over 85% of responses were prefered by the human at same rate or over at ChatGPT. However, more imitation data won’t close the gap, as according to the x-axis.
  • (Middle graph) However, more imitation data can cause decrease of accuracy.
  • (Right graph) For the number of parameters, the larger base model leads to the better performance. They thus concluded that rather than fine-tuning on more imitation data, the best way to improve model capabilities is still to improve the base capabilities, rather than the fine tuning process on the imitation data.

GPT-4 findings are similar to the human preferences that more imitation data doesn’t close the gap and a larger model will contribute to better performance.

They found little improvement with increasing amount of imitation data and size of imitation LM.

Question: Why is there a discrepancy between crowdworker (and GPT-4) preferences evaluation and automatic benchmark evaluation?

Authors’ conclusion: Limitation models can learn style, but not factuality.


Topic 1: What are the potential [benefits, risks, concerns, considerations] are of imitation models?

  • non-toxic scores is a benefit
  • It will have a better capability of imitating specific input-data style
  • One risk is that experiment results can be false based on style-imitating, not reflecting its intrinsic characteristics.

Most people in our class agree that there should be no legal concern for researchers and companies to reverse engineer other models via immitating. Professor also proposed another idea that tech companies like OpenAI relies heavily on public-accessible training data from the internet, which weaken the argument that companies have the right to exclusively possess the whole model.

False Promises Paper


  1. Fine-tuning alone can’t produce imitation models that are on par in terms of factuality with larger models.
  2. Better base models are most promising direction for improving open-source models (e.g. architecture, pretraining data).

Detecting Pretraining Data from Large Language Models

Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, Luke Zettlemoyer. Detecting Pretraining Data from Large Language Models. 25 Oct 2023.

Question: What if we knew more about the data the target models were pretrained on?

For detecting pretraining data (DPD) from LLMs, we divide the examples into two categories:

  • Seen example: Example seen during pretraining.
  • Unseen example: Example not seen during pretraining.

Here we introduce WikiMia Benchmark

Formalized definition of membership inference attack:

Note: this definition doesn’t really fit with what they do, which does depend on the assumption that the adversary has not only access to the distribution, but specific candidate documents (some of which are from the data set) for testing. There are many other definitions of membership inference attacks, see SoK: Let the Privacy Games Begin! A Unified Treatment of Data Inference Privacy in Machine Learning.

Min-K% Prob

Results on different settings for Min-K% Prob compared with other baseline algorithms (PPL, Neighbor, etc.) We can see that as the model size or text length increases, the detection becomes easier and Min-K% always have the best AUC.

DPD shows it is possible to identify if certain pretraining data was used, and touches on how some pretraining data is problematic (e.g. copyrighted material or personally identifiable information).

Orca: Progressive Learning from Complex Explanation Traces of GPT-4

Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, Ahmed Awadallah. Orca: Progressive Learning from Complex Explanation Traces of GPT-4. pdf


Instruction tuning is a technique that allows pre-trained language models to learn from input (natural language descriptions of the task) and response pairs. The goal is to train the model to generate the correct output given a specific input and instruction.

{“instruction”: “Arrange the words in the given sentence to form a grammatically correct sentence.”, “input”: “the quickly brown fox jumped”, “output”: “the brown fox jumped quick ly”}.

Challenges with Existing Methods

“Model imitation is a false promise” since “broadly matching ChatGPT using purely imitation would require:

  1. a concerted effort to collect enormous imitation datasets
  2. far more diverse and higher quality imitation data than is currently available.”
  3. Simple instructions with limited diversity: Using an initial set of prompts to incite the LFM to produce new instructions. Any low-quality or overly similar responses are then removed, and the remaining instructions are reintegrated into the task pool for further iterations.
  4. Query complexity: Existing methods are limited in their ability to handle complex queries that require sophisticated reasoning. This is because they rely on simple prompts that do not capture the full complexity of the task.
  5. Data scaling: Existing methods require large amounts of high-quality data to achieve good performance. However, collecting such data is often difficult and time-consuming.

What is Orca

Orca, a 13-billion parameter model that learns to imitate the reasoning process of LFMs. Orca learns from rich signals from GPT-4, including explanation traces, step-by-step thought processes, and other complex instructions, guided by teacher assistance from ChatGPT.

Explanation Tuning

The authors augment ⟨query, response⟩ pairs with detailed responses from GPT-4 that explain the reasoning process of the teacher as it generates the response. These provide the student with additional signals for learning. They leverage system instructions (e.g.., explain like I’m five, think step-by-step and justify your response, etc.) to elicit such explanations. This is in contrast to instruction tuning, which only uses the prompt and the LFM response for learning, providing little opportunity for mimicking the LFM’s “thought” process.

Large-scale training data with diverse tasks augmented with complex instructions and rich signals. ⟨ System message, User query, LFM response ⟩


Evaluation for Safety

The evaluation is composed of two parts:

  • Truthful Question Answering
  • Toxic Content Generation


Topic: Given sufficient styles and explanations in pre-training data, what risks have been resolved, what risks still exist, and what new risks may have emerged?

We think that imitation models should not be more biased because base models are generally less biased given a better ptr-training data. Since there are also sufficient styles, the risk of leaning towards a specific style is also mitigated.

Wednesday, 1 Nov: Data Matters
Investigating the Impact of Data on Large Language Models


This class was on the impact of data on the large language models.

Information exists in diverse formats such as text, images, graphs, time-series data, and more. In this era of information overload, it is crucial to understand the effects of this data on language models. The research have been predominantly concentrating on examining the influences of model and data sizes.


We are going to look at two important research questions using two of the research papers: (1) What types of data are beneficial to the models? – Using the research paper, LIMA: Less is more for Alignment (2) What are the consequences of low-quality data for the models? – Using the research paper, The Curse of Recursion: Training on Generated Data Makes Models Forget

LIMA: Less Is More for Alignment

Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, Omer Levy. LIMA: Less Is More for Alignment. pdf

Superficial Alignment Hypothesis — The model learns knowledge and capabilities entirely during pre-training phase, while alignment teaches about the specific sub-distribution of formats to be used or styles to be used when interacting with the users.


The authors curated a diverse set of training prompts, selecting 750 top questions from community forums and complementing them with 250 manually crafted examples of prompts and responses. They prioritized quality, diversity, and uniformity in their selection process, all while utilizing a smaller dataset. To evaluate performance, LIMA was benchmarked against state-of-the-art language models using a set of 300 challenging test prompts.


The left side showcases the outcomes of a human-based evaluation, comparing LIMA’s performance across various model sizes. Meanwhile, the right side employs GPT-4 for evaluation purposes. In the comparisons with the first two language models, LIMA emerges as the victor. Notably, it’s interesting to observe that LIMA secures a 19% win rate against GPT-4, even when GPT-4 itself is utilized as the evaluator.

Findings 2

They also evaluated LIMA’s performance in terms of out-of-distribution scenarios and robustness. LIMA safely addressed 80% of the sensitive prompts presented to it. Regarding responses to out-of-distribution inputs, LIMA performed exceptionally well.

Findings 3

In the concluding series of experiments, the research team delved into how the diversity of the test data influences LIMA.

They observed that the filtered Stack Exchange dataset exhibited both diversity and high-quality responses. In contrast, the unfiltered Stack Exchange dataset, while diverse, lacked in quality. On the other hand, wikiHow provided high-quality responses specifically to “how to” prompts. LIMA showcased impressive performance when dealing with datasets that were both highly diverse and of high quality.

The authors also explored the effects of data volume, with their findings illustrated in the right figure above. Interestingly, they noted that the quantity of data did not exert any noticeable impact on the quality of the generated content.

Personal Thoughts

This slide delves into the personal perspectives of the presenter on drawing parallels between Large Language Models (LLMs) and students, as well as evaluating the influence of LIMA on a student’s learning journey. The take-away message from the first paper was that quality and diversity of the data is more important for LLM than the quantity of the data.

The Curse of Recursion:
Training on Generated Data Makes Models Forget

Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, Ross Anderson. The Curse of Recursion: Training on Generated Data Makes Models Forget. pdf


The second paper is driven by the aim to scrutinize the effects from employing training data generated by preceding versions of GPT, such as GPT-(n-1), GPT-(n-2), and so forth, in the training process of GPT-(n).

Model Collapse

This paper discusses about model collapse, which is a degenerative process where the model no longer remembers the underlying true distribution. This mainly happens with data where the probability of occurrence is low.

What causes Model Collapse?

Causes for model collapse

The two main causes for the model collapse are: (i) Statistical approximation error that occurs as the number of samples are finite, and (ii) Functional approximation error that occurs due to the function approximators being not expressive enough.

Mathematical formulation of model collapse

Mathematical formulation of model collapse

Here, the model collapse is expressed through mathematical terms, providing a comprehensive overview of the feedback mechanism involved in the learning process.

Initially, the data are presumed to be meticulously curated by humans, ensuring a pristine starting point. Following this, Model 0 undergoes training, and subsequently, data are sampled from it. At stage n in the process, the data collected from step n - 1 are incorporated into the overall dataset. This comprehensive dataset is then utilized to train Model n.

Ideally, when Monte Carlo sampling is employed to obtain data, the resultant dataset should statistically align closely with the original, assuming that the fitting and sampling procedures are executed flawlessly. This entire procedure mirrors the real-world scenario witnessed on the Internet, where data generated by models become increasingly prevalent and integrated into subsequent learning and development cycles. This creates a continuous feedback loop, where model-generated data are perpetually fed back into the system, influencing future iterations and their capabilities.

Theoretical Analysis

The following four slides delve into a theoretical analysis of both discrete and continuous distributions.

Discrete Distribution: In scenarios where a discrete distribution is under consideration, events with low probabilities tend to degenerate when the sample size is finite. As the number of time steps advances towards infinity, the distribution converges towards a delta distribution. This phenomenon results in the preservation of only the events with high probabilities, while those less likely start to fade away.

Continuous Distribution: Assuming the initial distribution to be a Gaussian distribution, the distribution of Xji is analyzed. The resulting distribution follows a variance-gamma distribution, showcasing a divergence from the initial Gaussian form. When Mi = M, the variance’s difference exhibits a linear growth with respect to n. This indicates that as we proceed through time steps or iterations, the divergence from the initial distribution becomes more pronounced. To quantify the exact extent of divergence between distributions, the Wasserstein-2 distance is employed. This measure provides a precise way to understand how far apart the distributions have moved from each other. The analysis reveals that in order to maintain a finite distance as indicated by the Wasserstein-2 measure, Mi must increase at a rate that is super-linear with respect to n.

Note: this result didn’t make sense to anyone in the seminar. It seems to contradict the intuition (the variance should go to zero as n increases instead of increasing). Either something is wrong with the set up, or this is measuring something different from what we expect.

Justification for Theoretical Analysis

The figure shows numerical experiments for a range of different sample sizes. We can see from the figure that when the number of data sample is small, the estimation of mean and variance is not very accurate.

The following two slides provides steps to estimate mean and variance of general distributions other than one-dimensional Gaussian. They derive a lower bound on the expected value of the distance between the true distribution and the approximated distribution at step $n+1$, which represents the risk that occurs due to finite sampling.

The takeaway of the lower bound is that the sampling rate needs to increase superlinearly to make an accurate end distribution approximation.


To evaluate the performance of GMM, we can visualize the progression of GMM fitting process over time. From the figure, we can see within 50 iterations of re-sampling, the underlying distribution is mis-perceived, and the performance worsens over time.

As before, an autoencoder is trained on an original data source, which later will be sampled. Figure 9 on the left shows an example of generated data. Again, over a number of generations, the representation has very little resemblance of the original classes learned from data. This implies that longer generation leads to worse quality, as much more information will be lost. We can see from Figure 8 (below) that as with single-dimensional Gaussians, tails disappear over time and all of the density shifts towards the mean.

To assess model collapse in Large Language Models (LLMs), given the high computational cost of training, the authors opted to fine-tune an available pre-trained open-source LLM. In this context, as the generations progress, models tend to produce more sequences that the original model would produce with the higher likelihood.

This phenomenon closely resembles observations made in the context of Variational Autoencoders (VAEs) and Gaussian Mixture Models (GMMs), where over successive generations, models begin to produce samples that are more likely according to the original model’s probabilities.

Interestingly, the generated data exhibits significantly longer tails, indicating that certain data points are generated that would never have been produced by the original model. These anomalies represent errors that accumulate as a result of the generational learning process.


Q1. The class discussed the purpose of generating synthetic data. It seems unnecessary to use synthetic data to train on styles, as it can be achieved easily based on the experiments. However, for more complicated tasks that involves reasoning, it might require more human-in-the-loop to validate the correctness of the synthetic data.

Q2. The class discussion focuses on the significance of domain knowledge in the context of alignment tasks. When dealing with a high-quality domain-specific dataset that includes a golden standard, fine-tuning emerges as a preferable approach for alignment tasks. Conversely, Reinforcement Learning from Human Feedback (RLHF) seems to be more suitable for general use cases that aim to align with human preferences. Additionally, we explore the potential of leveraging answers obtained through chain-of-thought prompting for the purpose of fine-tuning.

Q3. The tail of a distribution typically contains rare events or extreme values that occur with very low probability. These events may be outliers, anomalies, or extreme observations that are unusual compared to the majority of data points. For example, in a financial dataset, the tail might represent extreme market fluctuations, such as a stock market crash.

Q4. This is an open-ended question. One possible reason is that while fine-tuning with generated data mix the original data with the generated data, this augmentation can introduce novel, synthetic examples that the model hasn’t seen in the original data. These new examples can extend the tail of the distribution by including previously unseen rare or extreme cases.

Readings and Discussions

Monday 30 October

Required Reading

Optional Readings

Discussion Questions

  1. In The False Promise of Imitating Proprietary LLMs, the authors attribute the discrepancy between crowdworker evaluations and NLP benchmarks to the ability of imitation models to mimic the style of larger LLMs, but not the factuality. Do the experiments and evaluations performed in the paper convincing to support this statement?

  2. In The False Promise of Imitating Proprietary LLMs, the authors suggest that fine-tuning is a process to help a model extract the knowledge learned during pretraining, and the introduction of new knowledge during fine-tuning may simply be training the model to guess or hallucinate answers. Do you agree or disagree with this idea? Why?

  3. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena expresses confidence in the effectiveness of strong LLMs as an evaluation method with high agreement with humans. Do you see potential applications for LLM-as-a-Judge beyond chat assistant evaluation, and if so, what are they?

  4. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena addresses several limitations of LLM-as-a-Judge, including position bias, verbosity bias, and limited reasoning ability. Beyond the limitations discussed in the paper, what other potential biases or shortcomings that might arise when using LLM-as-a-Judge? Are there any approaches or methods that could mitigate these limitations?

Wednesday 01 November

Required Readings

Discussion Questions

  1. Based on the Figure 1 from the LIMA paper, using high quality examples to fine-tune LLMs outperforms DaVinci003 (with RLHF). What are the pros and cons, usage scenarios of each method?

  2. There are lots of papers using LLMs to generate synthetic data for data augmentation and show improvement over multiple tasks, what do you think is an important factor when sampling synthetic data generated from LLMs?

  3. In The Curse of Recursion: Training on Generated Data Makes Models Forget, the authors discuss how training with generation data affects the approximation for the tail of the true data distribution. Namely, for training GMM/VAEs, the tail of the true distribution will disappear in the estimated distribution (Figure 8). On the other hand, when fine-tuning large language models with generation data, the estimator will have a much longer tail (Figure 10). What do you think causes such a difference? In the real world, what does the tail of a distribution represent and how do you think this phenomenon impacts practical applications?

  4. In The Curse of Recursion: Training on Generated Data Makes Models Forget, the authors rely on several assumptions to support their arguments. How strong those assumptions are and do you think these assumptions limit its applicability to broader contexts?

  1. Gudibande, A., Wallace, E., Snell, C., Geng, X., Liu, H., Abbeel, P., Levine, S. and Song, D., 2023. The false promise of imitating proprietary llms. arXiv preprint arXiv:2305.15717. ↩︎

Week 9: Interpretability

(see bottom for assigned readings and questions)

Presenting Team: Anshuman Suri, Jacob Christopher, Kasra Lekan, Kaylee Liu, My Dinh

Blogging Team: Hamza Khalid, Liu Zhe, Peng Wang, Sikun Guo, Yinhan He, Zhepei Wei

Monday, 23 October:
Interpretability: Overview, Limitations, & Challenges

Definition of Interpretability

  • Interpretability in the context of artificial intelligence (AI) and machine learning refers to the extent to which a model’s decisions, predictions, or internal workings can be understood and explained by humans. It’s the degree to which a model’s behavior can be made transparent, comprehensible, and meaningful to users, stakeholders, or domain experts.
  • In concept-based interpretability, the focus is on explaining the model’s decisions in terms of high-level concepts or features that make sense to humans. This approach seeks to relate model behavior to intuitive, domain-specific, or abstract concepts. For example, in a medical diagnosis model, concept-based interpretability might explain that a decision was made because certain symptoms or biomarkers were present.
  • Mechanistic-based interpretability aims to provide a detailed understanding of how the model makes decisions. This approach delves into the inner workings of the model, explaining the role of individual features, weights, and computations. For instance, in a deep learning model, mechanistic interpretability might involve explaining the contributions of specific layers in the decision process.

Why is interpretability important?

Interpretability is important because it builds trust, aids in debugging, and is essential in applications where the consequences of AI decisions are significant, for example:

  1. Learn how the model makes the decision.
  2. Analyze whether there are biases or shortcuts that a model is taking during application.
  3. When dealing with human-in-the-loop systems, interpretability enables humans to work collaboratively with AI, leveraging their complementary strengths to achieve better outcomes.

Salient Explainers

The main ideas of these approaches are:

A salient map example by different methods (introduced above) of a bird. While the vanilla gradient method output is noisy, the other methods “improve” the map visually, and we can thus gradually see a ‘clearer’ pixel-level attribution influence which aligns with human understanding of the concept ‘bird’. It’s also important to note that none of these methods was evaluated in a quantitative way. It is also important to consider how much ones that incorporate the input directly are actually explaining the model.

Saliency Map Demo


Example: given an image and text prediction provided by the model, a saliency map highlights the most important features or regions within the image that the model uses to make decisions.

These maps can help users understand the model's decision-making process, particularly in applications such as medical imaging or self-driving cars. To create the saliency map, randomly crop or mask parts of the image and compute the similarity between the cropped images and text. If the similarity value is positive, indicating the crop is closer to the query, it should be represented as a red region on the saliency map. Conversely, if the value is negative, it should be depicted as a blue region.

Limitations of Salient Explainers


Combining Salient Explainers with Generative AI?

  • Background: Diffusion models and Bayesian Flow Networks use gradient-based guidance to move generated images to the training distribution.
  • Discussion: Salient explainers would help identify which characteristics of the training images are used, which improves understanding of generative AI models (such as GAN and Diffusion Models). For example, understanding the score function that leads to the direction of pushing the noisy distribution to a real image. For GANs, discriminators can use it to capture the area. For diffusion models, it can help to explain why the image is generated towards some features to make it realistic.

Attention Explainers

Attention plays a pivotal role in numerous applications, particularly in Natural Language Processing (NLP). For instance, the choice of a specific word, along with the assigned weight, signifies a correlation between that weight and the decision made.

In NLP, understanding which words or elements are emphasized, and to what extent (indicated by their respective weights), is crucial. This emphasis often sheds light on the correlation between the input data and the model's decision-making process. Attention-based explainers serve as valuable tools in dissecting and interpreting these correlations, offering insights into how decisions are derived in NLP models.

Limitations of Attention: Attention is not Explanation

Jain and Wallace’s Attention is not Explanation, NAACL 2019

Sarah Wiegreffe, Yuval Pinter. Attention is not not Explanation. EMNLP, 2019.

A recent study has cautioned against using attention weights to highlight input tokens “responsible for” model outputs and constructing just-so stories on this basis. The core argument of this work is that if alternative attention distributions exist that produce similar results to those obtained by the original model, then the original model’s attention scores cannot be reliably used to “faithfully” explain the model’s prediction.

Further Discussion: Attention is not not Explanation

Visualization of the heatmap of the attention weights at different index for 2 methods introduced by the Attention is not not Explanation paper. The upper one is related to model with trained attention weights and the lower one is related to model with uniform frozen attention weights.

Both figures show the average attention weights over the whole dataset. Clearly, the model with trained attention weights has a noisier heatmap which is due to the difference in the text content and the model with uniform frozen attention weights has a clearer pattern which only relates to the length of the text.

If attention was a necessary component for good performance, we would expect a large drop between the two rightmost columns (i.e. comparison of model with trained attention weights and frozen uniform attention weights). Somewhat surprisingly (although less surprising when one considers the evaluated tasks), for three of the classification tasks the attention layer appears to offer little to no improvement whatsoever. In these cases, the accuracies are near identical on 3 out of 5 datasets, and so attention plays no role in explanations if we don’t need it in our prediction.

The above table presents several insightful findings:

  • The baseline LSTM outperforms others, suggesting the significance of a particular set of attentions, underscoring their relevance in the process (these attentions are learned and preserved for the MLP model).
  • The trained MLP exhibits competitive performance, hinting that while the specific attentions might not be as crucial, the model is still capable of learning close attention weights, signifying its adaptability in this context.
  • Conversely, the Uniform model yields the poorest performance, emphasizing the substantial role of attention mechanisms in this scenario.

The evaluation of these results highlights a crucial aspect: the definition of an explanation. Although these attention weights potentially hold utility (as observed in various settings), their direct interpretation is not always straightforward.

The initial hypothesis (on the right) proposed that fluctuating attention confidence minimally affected the output. However, after employing pretrained attentions (on the left), it became evident that higher attention weights correspond to reduced variance. (In the visualization, the length of the bubble represents variance, with tighter bubbles indicating the ideal scenario. Colors denote positive/negative labels.)



Group 1: Human interpretability features are essential. For instance, in the application of AI in drug design, an AI model alters a molecule from a non-drug variant to its drug counterpart. Presently, the predominant approach involves creating a black-box model that transforms one molecule into its drug form, contributing to a lack of credibility in the process. For example, a doctor unfamiliar with deep learning cannot comprehend the inner workings of the model that facilitates the conversion of regular molecules into drugs, making it challenging to ensure the safety, efficacy, and trustworthiness of the AI-driven drug design without transparent and interpretable models.

Group 2: A general observation we’ve noted is that the requirements or metrics for Explainable AI (XAI) significantly depend on the intended purpose of the tool. For instance, the explanations provided for AI researchers and end users are likely to differ due to their distinct focuses. On the other hand, there is a risk of amplifying confirmation bias if we anticipate XAI to explain phenomena in alignment with human beliefs. To truly understand why a model performs effectively, it might be necessary to set aside human biases and preconceived notions, enabling an unbiased exploration of how the model achieves its performance.

Group 3: Currently, there are still lots of difficulties and problems in using XAI tools in general. For example, methods such as LIME and SHAP always need a lot computation and don’t work well on complex large models. Besides, we lack ground-truth explanations and therefore, we don’t know whether the learned explanations are useful or not. Our suggestions for solving those problems and also issues mentioned by other groups are: 1) Approximating the complicate model to some simple model 2) Build self-explainable models 3) Combine different metrics and XAI tools.

Towards Provably Useful XAI

The critical point here underscores the necessity of task-specific techniques, which may seem self-explanatory—how can general principles apply without a specific context? Even then, this assumption is not necessarily guaranteed.

One of the primary challenges between the current state of eXplainable Artificial Intelligence (XAI) and potentially valuable XAI methods is the absence of a method-task link. Ensuring the usability and reliability of an explanation in a particular task requires a deeper connection. This could either involve anchoring explanations in theory directly aligned with the task's requirements or creating task-inspired explanations and subsequently empirically evaluating them within the targeted application.

Wednesday, 25 October: Mechanistic Interpretability

Mechanistic Interpretability is the process of reverse-engineering neural networks into understandable computer programs. It is often motivated by the goal of ensuring a models' behavior is both predictable and safe (but unclear if it can ever achieve such a goal).

Introduction: Mechanistic interpretability vs concept-based interpretability

  • Mechanistic interpretability focuses on trying to understand the model's internal mechanics/working.
  • It involves tracing the process from input to output to make the high dimensional mathematics within the network more understandable to humans.
  • In contrast, concept-based interpretability uses human-understandable concepts and model structure to explain.
  • For instance, a concept-based interpretable network might use subnetworks or branches to determine the required output.

Softmax Linear Units (SoLU)

Nelson Elhage, Tristan Hume, Catherine Olsson, Neel Nanda, Tom Henighan, Scott Johnston, Sheer El Showk, Nicholas Joseph, Nova DasSarma, Ben Mann, and others (Anthropic AI). Softmax Linear Units. Transformers Circuit Thread, 2022.

Superposition Hypothesis

  • Superposition hypothesis is the basis of the problem being tackled by the SoLU paper.
  • The general idea is that the networks we train are based in a much much larger network, where each neuron is its own disentangled feature.
  • Neural network layers have more features than neurons as part of a “sparse coding” strategy learned by the network.
  • This means most neurons are polysemantic, responding to several (unrelated) features.
  • This hypothesis states that there is no basis in which individual neuron activations are interpretable since the (number of features) > (number of neurons).

Solutions to Superposition

  • There are two solutions to superposition:
    1. Create models with less superposition (the focus of this presentation and paper)
    2. Find a way to understand representations with superposition
  • Representation is the vector space of a neural network’s activations.
  • Features are independently understandable components of a representation in order to make it easier to understand.
  • Non-privileged basis: such representations don't come with any "special basis” thus making it difficult to understand them. The model is not trying to optimize for these features.
  • Privileged basis: it is plausible for features to align with this basis

SoLU vs GeLU

There are several ways SoLU is designed to reduce polysemanticity:
  • Lateral inhibition, approximate sparsity and superlinearity can be achieved by changing the MLP activation function.
  • Instead of sigmoid in GeLU, they use softmax, and drop the constant (1.7).
  • However, this led to a massive drop in performance, so they added an extra LayerNorm after SoLU.
  • The intuition was that it might fix issues with activation scale and improve optimization.
  • However, the authors admit that one reason for the performance improvement may be that the extra LayerNorm may allow superposition to be smuggled through in smaller activations.

SoLU Motivating Examples

When SoLU is applied on a vector of large and small values (4, 1, 4, 1), the large values will suppress smaller values. Large basis aligned vectors e.g. (4, 0, 0, 0) are preserved. A feature spread across many dimensions (1, 1, 1, 1) will be suppressed to a smaller magnitude.

Performance vs. Explainability

Although performance on overall tasks tends to align with the training set's general performance, it's important to note that this may not reveal shortcomings in specific tasks or areas. To ensure a comprehensive understanding, the researchers conducted various evaluations on representative tasks, corroborating the insights gained from the loss curves.

They assessed their model's performance on a variety of datasets, including Lambada, OpenBookQA, ARC, HellaSwag, MMLU, TriviaQA, and arithmetic datasets, and the results are displayed in Figure 2. The authors observed that, overall, there were similar performance levels between the baseline and SoLU models across different model sizes. However, they did notice notable differences in a couple of tasks. For example, the SoLU model performed better on arithmetic tasks, while the baseline model outperformed on TriviaQA. Nevertheless, there wasn't a consistent trend favoring one model over the other across most tasks.

Are SoLU Neurons Interpretable?

To check if a neuron is easy to understand at the first glance, the researchers had people (some of them were authors of the study) look at a set of text snippets. These snippets usually contained about 20 short paragraphs, and the focus was on words that the neuron put a large weight on. These important words were highlighted in various shades of red to show how much weight the neuron gave them. This made it easy for the evaluators to quickly go through the snippets and spot any common themes. You can see an example of what these evaluators saw in the figure. (Note that this experiment includes no control — humans are prone to finding patterns in randomness also.)

Interpretability of Neurons in SoLU vs Baseline Transformer

This shows the results of human experiments on interpretability of neurons in SoLU vs baseline transformer for various model sizes. The authors used transformers with different numbers of layers, from 1 to 64. The blue line shows the proportion of neurons in the baseline transformer that were marked as potentially having a clear interpretation across these different layer counts. The red line shows the same thing for the SoLU transformer. The green dot specifically represents the proportion of interpretable neurons in the 16-layer model that had an extra layer-norm but not SoLU. In general, for models with 1 to 40 layers, using SoLU increased the number of neurons that could be easily understood by about 25%. However, in the 64-layer model, the improvement was much smaller.

LayerNorm Complications

This figure shows the fraction of neurons inconsistent with primary hypothesis. We observe that generally with the increase of activating dataset samples, the fraction of inconsistent neurons decrease. And after layer normalization, inconsistent neurons increases.

Class Activity: Identify Feature Mappings

We can observe some interpretable features mappings from these highlighted patterns. For example, orange neuron represents the words of the form verb+ing, cyan neuron represents words with prefix 'sen'. The purple highlights are random, and no humans hallucinated an interpretation for them.


Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, and others (Anthropic AI). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. Transformers Circuit Thread, 2023.

The authors use a weak dictionary learning algorithm called a sparse autoencoder to generate learned features from a trained model that offer a more monosemantic unit of analysis than the model's neurons themselves.

Architectural Limitations

The model framework for SoLU paper has an architectural limitation. It designed activation functions to make fewer neurons be activated for to make the model more interpretable, but this process push the model sparsity too much, which makes the neurons encouraged to to be polysematic. Here, a neuron is polysemantic if the neuron can represent more than one interpretable feature mapping.

Model Overview

The purpose of this paper is to clearly demonstrate the effectiveness of a sparse autoencoder in achieving two main objectives: extracting understandable features from superposition and facilitating basic circuit analysis. Specifically, the authors achieve this by using a one-layer transformer with a 512-neuron MLP (Multi-Layer Perceptron) layer. They break down the activations of the MLP into features that are relatively easy to interpret by training sparse autoencoders on the MLP activations obtained from a massive dataset comprising 8 billion data points. These autoencoders have varying expansion factors, ranging from 1×(resulting in 512 features) to 256×(resulting in 131,072 features).

Features as a Decomposition


The authors decompose the activation vector with the first equation, which is a combination of more general features which can be any direction. In the equation, $x_j$ is the activation vector for datapoint $j$, $f_i(x^j)$ is the activation of feature $i$, each $d_i$ is a unit vector in activation space called the direction of feature $i$, $b$ is the bias.

The Critetrion of Being a Good Decomposition

This shows an example of a "good" feature decomposition. The criterion are:

  1. We can interpret the conditions under which each feature is active. In the example, we know that the condition of the feature 4 to be activated is the appearance of {'Hello', ..., 'How's it going}, the positive words.
  2. We can interpret the downstream effects of each feature, i.e., the effect of changing the value of feature on subsequent layers. This should be consistent with the interpretation in (1).
In this example, when we see the activation value of the feature 4 increase, then the text's negative sentiment should decrease, because the text are more probable to use the positive words {'Hello', ..., 'How's it going}.

Sparse Autoencoders

Sparse autoencoders use several techniques and strategies to extract interpretable features from neural network activations:

  1. MSE Loss for Avoiding Polysemanticity: Emphasizes using Mean Squared Error (MSE) loss instead of cross-entropy loss to prevent the overloading of features.
  2. Larger Internal Dimension: Advocates for a larger number of internal features within the autoencoder to create an overcomplete set of features.
  3. L1 Penalty: Applies an L1 penalty on the internal features to encourage sparsity within the representation.
  4. Input Bias: Introduces an approach of adding an input bias to the representations in autoencoders, which demonstrates a significant boost in performance for the models used in toy examples.

The purpose of sparse autoencoders is to extract meaningful features from neural network activations. To avhice a good decomposition, where the features extracted should be interpretable and able to describe the activations’ context requires the ability to describe activations, interpret downstream effects of changes, and cover a significant portion of functionality within the data.

Are these features “interpretable”

Feature Activation Sampling Bias: In previous evaluations, there was a bias due to just considering the top-activation neurons which might inaccurately appear monosemantic due to their higher activations. To mitigate this bias, the approach involves sampling uniformly across all possible activations for each given feature.

Evaluation of Interpretable Features: The authors used an evaluation process where human-based assessments are used to determine the interpretability of the features extracted. The criteria for interpretability are based on the authors’ distributed-based evaluation, where a score above eight is considered sufficiently interpretable.

Automated Evaluation

The authors used a larger language model to generate a text description of each feature. For the features from the sparse autoencoder, there is a higher correlation with human interpretations, with average correlation values reaching as high as 0.153 in some instances, and up to 0.7 in larger models.

Group Discussions

The sparse autoencoder technique in focus can explain up to 80% of the loss. This means that by replacing activated neurons with reconstructions, 80% of the original model’s loss can be accounted for without altering the model. Notably, there is a high correlation (Spearman correlation around 0.7) between independent features of two models sharing the same architecture but having different random initializations.

Considering these evaluation findings, the class divided into three groups to discuss specific questions related to the interpretability of features.

One group noted a common discrepancy between the expectations from language models and humans. Language models are often expected to perform at superhuman or domain expert levels, while their capabilities might align more closely with those of a standard human. The use of a general-purpose language model for features requiring domain expertise was seen as a potential issue, as the model’s interpretation might lack the required domain-specific knowledge.

The group also shared their discussion about the possibility that language models might ‘hallucinate’ interpretations for given features, possibly creating false correlations or interpretations that do not necessarily exist within the data. Human evaluators might also introduce subconscious biases or look for patterns without having the necessary domain expertise, potentially affecting the interpretability findings. Another key point they raised was about the intended audience for interpretability. They discussed that interpretability work from the language models might primarily target researchers, specifically computer science researchers who are involved in developing and assessing models, rather than end-users of these models in practical applications.

The second group highlighted the multifaceted nature of the variance observed in models designed to find interpretable features. It primarily attributed this variability to stochastic elements in the learning process, the order and sequence of data during training, and the diverse interpretations resulting from these variations, which may lead to equally interpretable yet different feature sets.

The final group emphasized the potential acceptability of the unexplained 20% in certain contexts, underscoring the value in correctly interpreting the majority of the content. Additionally, they noted the potential nuances within the unexplained portion, distinguishing between varying reasons for lack of interpretability within that portion.

Feature Splitting

The training of three different versions of autoencoders with increasing sizes of the internal representation were described, leading to more sparsity in interpretable features. They analogized a dictionary learning algorithm to an unlimited number of interpretable features, Even with varied model semantics, a structured superposition of concepts emerges in the learning process.

By feature clustering and splitting, this splitting of features leads to more fine-grained interpretations, where a single concept or feature might bifurcate into multiple similar but distinct interpretable features. These findings may have benefits beyond one-layer transformers, suggesting the possibility of applying this technique to larger transformers or models.


The summary underscores the potential and limitations of both architectural changes aimed at controlling polysemanticity and the potential for post-learning techniques, but so far only partially demonstrated for a simple 1-layer transformer.

Post-learning interpretation seems more practical for now than adapting existing training techniques for interpretability, which would require larger changes to current practices.

Readings and Discussions

Monday 23 October

Required Reading

Optional Readings

Discussion Questions

  1. Chaszczewicz highlights shared challenges in XAI development across different data types (i.e. image, textual, graph data) and explanation units (i.e. saliency, attention, graph-type explainers). What are some potential similarities or differences in addressing these issues?

  2. In cases where models produce accurate results but lack transparency, should the lack of explainability be a significant concern? How should organizations/developers balance the tradeoffs between explainability and accuracy?

  3. How can XAI tools could be used to improve adversarial attacks?

  4. In Attention is not not Explanation, the authors dispute a previous paper’s definition of explanation. Whose view do you find most convincing and why?

Wednesday 25 October

Required Readings

Discussion Questions

  1. (Softmax Linear Units) Elhage et al. present the Superposition Hypothesis which argues that networks attempt to learn more features than the number of neurons in the networks. By delegating multiple features to a single node, interpreting the significance of the node becomes challenging. Do you believe this hypothesis based upon their explanation, or do you suspect there is some separate obstacle here, such as the counter-argument that nodes could represent standalone features that are difficult to infer but often obvious once discovered?

  2. (Softmax Linear Units) Do you see any difference between SLU and ELU coupled with batch-norm/layer-norm? How does this relate to the reasons the LLM community shifted from ReLU (or variants like ELU) to GeLU?

  3. (Towards Monosemanticity) Could the identification of these “interpretable” features could enable training (via distillation, or other ways) smaller models that still preserve interpretability?

  4. (Towards Monosemanticity) Toying around with visualization seems to show a good identification of relevant positive tokens for concepts, but negative concepts do not seem to be very insightful. Try the explorer out for a few concepts and see if these observations align with what you see. What do you think might be happening here? Can it possibly be solved by changing the auto-encoder training pipeline, or possibly by involving structural changes like SLU? Are there other interesting elements or patterns you see?

Week 8: Machine Translation

(see bottom for assigned readings and questions)

Machine Translation (Week 8)

Presenting Team: Ajwa Shahid, Caroline Gihlstorf, Changhong Yang, Hyeongjin Kim, Sarah Boyce

Blogging Team: Xindi Guo, Mengxuan Hu, Tseganesh Beyene Kebede, Zihan Guan

Monday, 16 Oct:
Diving into the History of Machine Translation

Let’s kick off this topic with an activity that involves translating an English sentence into a language of your choice and subsequently composing pseudocode to describe the process.

Here is an example of pseudocode from the activity:

Sentence = "The students like to read interesting books."
# The bilingual dictionary from English to Chinese: Eng_chinese_dict
Translation = []
for word in Sentence.split():
    if word in Eng_chinese_dict:
Translated_sentence = " ".join(Translation)

After the activity discussion, here are the challenges encountered when translating from English to another language:

  • Variations in Word Order: Different languages have varying word orders, affecting sentence structure.
  • Idiomatic Expressions: Idioms and phrases may lack direct equivalents in other languages, requiring creative translation.
  • Plurality and Gender: Managing plural forms and gender variations can be complex across languages.
  • Verb Conjugations: Verbs change for different tenses and moods, complicating translation.
  • Subject-Object-Verb Order: Sentence structure differences influence how subjects, objects, and verbs are translated.
  • Verb Tense: Addressing past, present, and future tenses accurately in translation is crucial.

The Early Days of Machine Translation

The birth of machine translation can be traced back to 1933 with the work of George Artsrouni and Petr Smirnov-Troyanskii. Artsrouni developed an automatic multilingual word lookup, which can be viewed as a precursor to the modern digital dictionary. Smirnov-Troyanskii conceptualized a machine translation process where humans and machines work together, with different steps encoded by humans and others by the machine.

Different Generations of Machine Translation Approaches

Machine translation approaches can be categorized into three primary generations, each having its unique characteristics and methodologies.

First Generation: Direct Translation

The MT system is designed in all details specifically for one particular pair of languages, e.g. Russian as the language of the original texts, the source language, and English as the language of the translated texts, the target language. Translation is direct from the source language (SL) text to the target language (TL) text;Typically, systems consist of a large bilingual dictionary and a single monolithic program for analyzing and generating texts; such ‘direct translation’ systems are necessarily bilingual and unidirectional.

Second Generation: Interlingua and Transfer Approach

The Interlingua approach proposed the use of a universal semantic representation, known as Interlingua, between the source and target languages. This approach offered the advantage of being multilingual, where it was easy to swap source languages, thus reducing the number of models needed. In contrast, the Transfer approach used abstract representations of both the source and target languages. This approach required an extensive collection of dictionaries, grammar rules, and language structure rules, including syntax, morphology, and possibly semantics.

Third Generation: Statistical Methods

The third generation of MT brought a significant shift by introducing statistical methods. This generation marked a transition from rule-based methods to learning-based methods, leveraging data and probability to guide translation.

A Timeline of Machine Translation Evolution

In the 1950s, theory-driven machine translation began, heavily influenced by linguistics and computer science, with a primary focus on Russian-English and English-Russian translation and the use of some statistical methods.

The 1960s introduced challenges in representing semantics and the “encoder-decoder” structure, a foundational concept in modern neural machine translation.

In the 1970s, the field transitioned from interlingua to transfer methods and explored AI concepts.

The 1980s brought multilingual MT systems, revisiting interlingua methods and advancing semantic representations.

The 1990s witnessed a significant shift from rule-based methods to “corpus-based” and “example-based” approaches, emphasizing data-driven methods in machine translation. It also marked the initial use of neural networks and the integration of machine translation with speech recognition, opening new horizons in the field.

Neural Machine Translation

Neural Machine Translation (NMT) uses neural network models to develop a statistical model for machine translation. Unlike traditional phrase-based translation systems that comprise multiple small sub-components tuned separately, NMT attempts to build and train a singular, large neural network that reads a sentence and outputs the correct translation.

Early Models

Recurrent Neural Network (RNN)

RNNs process sequential data one element at a time, maintaining a hidden state that captures information from previous elements to inform future predictions or classifications. In the image above, ‘x’ represents individual elements from sequential data, ‘A’ represents the Neural Network, and ‘h’ signifies the hidden layer. At each time step, it processes sequential data step-by-step.

The image above illustrates how RNN works for language translation. To translate a sentence like ‘How are you?’ from English to French using an RNN Encoder-Decoder, the English input is initially encoded through an Encoder. Subsequently, the Decoder generates the French output word by word. This sequential process results in the French translation, such as ‘Comment allez vous ?,’ from the English input.

However, RNNs have a limitation in handling long-term dependencies. When dealing with a large amount of data, RNNs may struggle to memorize it all.

One solution to this limitation is Long Short Term Memory networks (LSTM), a specific type of RNN designed to capture long-term dependencies. LSTMs incorporate forget, input, and output gates that regulate the flow of information within the network:

  • The forget gate determines whether to retain or discard certain information.
  • The input gate quantifies the significance of new information introduced.
  • The output gate determines what information should be produced.

Simultaneously, the attention mechanism stands as a pivotal development in NMT. It grants the decoder access to all hidden states from the encoder, enabling the network to selectively focus on different segments of the input sequence during each step of output generation. This enhances the model’s ability to handle long sentences and complex dependencies. The model gains access to all inputs through the use of bidirectional RNNs. Additionally, the prediction of the next word depends on the weighted combination of these hidden states.

Transformer Model

The Transformer model is a type of neural network architecture primarily used in the field of natural language processing.

Key Advantage: Unlike models that process sentences sequentially, a Transformer processes the whole sentence at once, which can lead to faster computation and the ability to parallelize the process.

In the self-attention mechanism, a given batched input X is linearly projected into three distinct representations: Query (Q), Key (K), and Value (V). The attention score is computed using Q and K. If padding is applied, all padding positions are masked in the attention score.

Multi-Head Attention In a Transformer model, Q, K, and V are divided into multiple splits, and each split is passed into a separate head, termed as a “multi-head” attention system. Each head computes its attention independently, and all attention results are then concatenated, effectively reversing the split operation. The multi-head attention mechanism enables each head to learn different aspects of the meanings of each word in relation to other words in the sequence. This allows the Transformer model to capture richer interpretations of the sequence.

Attention Examples

In conclusion, the field of Machine Translation has witnessed remarkable progress, with each new model or mechanism contributing to better language understanding and translation. Currently, Transformer models stand at the forefront of this field, but ongoing research promises further advancements.

Wednesday, 18 Oct:
Challenges of Machine Translation

Six Challenges of Neural Machine Translation

Philipp Koehn and Rebecca Knowles. Six Challenges for Neural Machine Translation. First Workshop on Neural Machine Translation, 2017. [PDF]

Key Insights:

In-domain performance: When trained and tested on the same domain, NMT and SMT systems generally perform well, with Medical and IT domains showing particularly high scores.

Out-of-domain performance: NMT systems tend to degrade more compared to SMT when applied to domains they were not trained on. For example, an NMT trained on Law performs relatively poorly in the Medical domain, while the SMT retains some effectiveness.

Key Insights:

General Trend: For all three models, as the amount of training data increases, the BLEU score (indicative of translation quality) also increases. This suggests that having more training data generally leads to better translation performance.

Comparative Performance: The Phrase-Based with Big LM model consistently outperforms the other two models across different corpus sizes. The Neural model (NMT) starts off better than the simple Phrase-Based model for smaller corpus sizes but tends to converge with the Phrase-Based model as the corpus size grows.

NMT provides superior performance when translating words that are infrequent (rarely used) or completely untrained (words not seen during the training process). This could suggest that NMT is more adaptable and flexible in handling diverse vocabulary than SMT.

While NMT excels at translating shorter sentences, SMT appears to be more effective for much longer sentences. This might indicate that the statistical approach of SMT is better equipped to handle the complexities and nuances of longer sentence structures.

From this data, it can be inferred that the choice of word alignment method can influence the accuracy and quality of translations. Different language pairs also exhibit varying levels of alignment match and probability, suggesting that the effectiveness of these methods can vary based on the languages in question.

Beam search is a heuristic search strategy that systematically expands the most promising nodes in a tree-like structure to improve sequence prediction results. The visual representation effectively demonstrates the branching and exploration inherent in the beam search algorithm.

While all these issues are crucial, most of the class considered “Domain adaptation/mismatch” and “Amount of training data” to be the most pressing. As the digital world grows and diversifies, machine translation tools will be exposed to an ever-increasing array of content types. Ensuring that these tools can adapt to various domains and are trained on representative datasets will be key to their effectiveness and relevance.

Domain adaptation in NMT focuses on ensuring translation accuracy for specialized content, like medical or legal texts, given the varied data distribution between training and target domains. Additionally, the infrequent occurrence of specific words, notably proper nouns, in training data can result in mistranslations. Overcoming these hurdles is crucial for enhancing NMT’s accuracy and applicability in different contexts.

While both LLM and NMT are rooted in deep learning and can comprehend linguistic context, they differ in their primary objectives, training datasets, and architecture. LLMs are versatile and can handle diverse language tasks beyond just translation, while NMTs are specialized for translating between two languages.

The comparison highlights the translation performance debate between LLMs and MT models. While both commercial and open-source MTs are valuable, LLMs fine-tuned with general-purpose instructions often excel. The data emphasizes the significance of scale and fine-tuning in LLM effectiveness.

The diagrams depict the intricate processes by which Large Language Models (LLMs) engage in translation tasks:

Content Source: The initial content that needs translation can come in various formats, including PDFs, Office documents, and even video.

Automated TM Management: This is an automated system for managing ‘Translation Memories’ (TM). Translation memories store previously translated segments of text to ensure consistent translations and speed up the translation process in future tasks.

Pre-translation: The untranslated content is matched against a Vector Database (DB) to identify similar contexts or content. The system then uses the K-nearest neighbours method to reference the closest matches. Relevant training data is extracted to fine-tune the model for a specific task or context. The model undergoes ‘Prompt Engineering Fine-Tuning’ (PEFT) using the LoRA method, enhancing its precision for specific tasks or contexts.

Hyper-personalized MT engine: Specific prompts are crafted for the content and its context. The model learns in the given prompt’s context, further enhancing translation accuracy. The LLM API allows other systems or processes to interact seamlessly with the LLM. The LLM works in tandem with the LoRA module, adding an extra layer of functionality and precision to the translation process.

Human Interaction: Even with advanced models, there’s a phase where human experts intervene, either for post-editing or proofreading, ensuring the final content adheres to the highest standards.

Top Quality MT Output: After all these stages, the translation process culminates in producing content of the highest caliber, ensuring accuracy and context preservation.

Both diagrams underscore a blend of automation, advanced modeling, and human expertise to achieve top-notch translations.

Database of Translated Texts: The cluster of pages symbolizes a collection of previously translated documents stored for reference.

Database of Translated Texts: The cluster of pages symbolizes a collection of previously translated documents stored for reference.

Search Process: The magnifying glass indicates the process of searching within the database to find a matching or similar translation.

User Interaction: The silhouette represents a user or translator interacting with the translation memory system.

Original Content (A): The page with the letter “A” signifies source content awaiting translation. The highlighted segments on this page denote the parts of the text that have matching translations in the memory.

Translated Content (X): The page with the symbol “X” showcases the result after using the translation memory. The highlighted segments indicate the portions of the content retrieved from the memory, ensuring consistency and saving time.

Savings/Cost-Efficiency: The stack of coins symbolizes the financial advantage or savings gained from using translation memory, reaffirming the caption that states “Translation memory is the customer’s moneybox.”

This visual displays how Translation Memory systems improve efficiency and consistency in translation tasks by reusing previously translated segments.

The diagram illustrates three fine-tuning methodologies for neural networks:

Classic: Involves iterative corrections based on errors.

Freeze: Retains the original weights and makes separate, task-specific adjustments.

LoRA (Low Rank Adaptation): Directly integrates with pre-trained models for efficient task-specific adaptations without extensive error corrections.

In essence, while the classic method emphasizes error corrections, the freeze approach preserves foundational knowledge, and LoRA offers a streamlined adaptation process. The choice among them hinges on the task and desired model refinement.

Benefits related to machine translation include:

  • Stylized: LLMs can adapt to different translation styles, such as formal, informal, or even regional variations.
  • Interactive: They can provide real-time translations during interactive sessions, such as live chats.
  • TM based: LLMs can utilize Translation Memory to ensure consistency across large documents or series of documents, improving translation quality by leveraging prior translations.

Evolution of MT Techniques: Traditional Machine Translation has evolved to incorporate more sophisticated methods, with the advent of GPT models introducing new paradigms.

Stylized MT: Goes beyond standard translation by adapting to specific styles or tones, like literature or marketing. Despite Neural Machine Translation (NMT) having capabilities for style transfer, its potential is often limited by data availability. LLMs like GPT can overcome this using zero-shot prompts.

Interactive MT: Represents a shift towards more user-centric translation methods. By actively involving the user in the translation process and gathering their feedback, it ensures translations are more contextually accurate and meet specific requirements.

Translation Memory-based MT: Aims to improve translation efficiency by referencing past translations. LLMs bring an added advantage by using past translations not just for replication but as context-rich prompts to guide the current translation process.

Emergence of New Paradigms & Concerns: The integration of LLMs in MT introduces new evaluation methodologies. However, this also raises potential privacy concerns, emphasizing the importance of data ethics in AI-driven translation.

Multi-modality in MT: Suggests a future direction where translations aren’t just based on text but incorporate multiple forms of data, enriching the translation process.

Dependency on LLMs: The consistent reference to GPT models across various MT applications indicates the growing influence and reliance on LLMs in modern translation efforts.

The transformative role of LLMs in reshaping machine translation, offering enhanced accuracy, versatility, and user engagement.

The activity showcases the challenges and complexities of machine translation. While tools like DeepL can offer rapid translations, they may sometimes miss cultural, contextual, or idiomatic nuances present in human translations. This is especially relevant for movie dialogues where context and tone play crucial roles. Can MT outperform human translation? MT can be faster and efficient for large-scale tasks, but human translation excels in capturing nuance, cultural context, and idiomatic expressions.

What are other challenges that can’t be solved by MT? MT struggles with cultural nuances, idioms, historical context, and emotional undertones which humans can naturally grasp.

How can HT and MT interact for better Language Translation? A hybrid approach, combining MT’s speed and efficiency with HT’s contextual understanding, can lead to more accurate and nuanced translations.

Monday, 16 October

Required readings:

Shereen A. Mohamed, Ashraf A. Elsayed, Y. F. Hassan and Mohamed A. Abdou. Neural machine translation: past, present, and future. Neural Computing and Applications, 2021. [] (

Optional readings:

W. John Hutchins. Machine Translation: A Brief History. From Concise history of the language sciences: from the Sumerians to the cognitivists (edited by E. F. K. Koerner and R. E. Asher). Pergamon Press, 1995. [PDF]

Peter F. Brown, John Cocke, Stephen A. Della Pietra, Vincent J. Della Pietra, Fredrick Jelinek, John D. Lafferty, Robert L. Mercer, and Paul S. Roossin. A Statistical Approach to Machine Translation. Computational Linguistics 1990. [PDF]

Linfeng Song, Daniel Gildea, Yue Zhang, Zhiguo Wang, Jinsong Su. Semantic Neural Machine Translation Using AMR. [Transactions of the Association for Computational Linguistics, 2019. [PDF]

Discussion Questions

(post your response by Sunday, 15 October)

  1. What are the limitations of the testing benchmarks used for machine translation (as described in Neural machine translation: past, present, and future) and how might these limitations impact model development?

  2. The paper describes neural machine translation (NMT) models as simpler than previously utilized statistical machine translation (SMT) models, and lists a few ways in which this is the case. Are there any drawbacks to NMT models over SMT models, particularly when it comes to interpretability and assuring that essential linguistic knowledge is learned?

  3. Why do most LLMs use decoder-only architecture? Why not encoder-decoder?

Wednesday, 18 October

Required readings

Chenyang Lyu, Jitao Xu, Longyue Wang. New Trends in Machine Translation using Large Language Models : Case Examples with ChatGPT. [PDF]

Patrick Fernandes, Daniel Deutsch, Mara Finkelstein, Parker Riley, André F. T. Martins, Graham Neubig, Ankush Garg, Jonathan H. Clark, Markus Freitag, Orhan Firat. The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation. [PDF]

Optional readings

Philipp Koehn and Rebecca Knowles. Six Challenges for Neural Machine Translation. First Workshop on Neural Machine Translation, 2017. [PDF]

Vivek Iyer, Pinzhen Chen, Alexandra Birch. Towards Effective Disambiguation for Machine Translation with Large Language Models.

Danielle Saunders. Domain adaptation for Neural Machine Translation. PhD Dissertation, University of Cambridge, February 2021. [PDF]

Radhika Sharma, Pragya Katyayan, Nisheeth Joshi. Improving the Quality of Neural Machine Translation Trough Proper Translation of Name Entities. [PDF]

Verna Dankers, Christopher Lucas, Ivan Titov. Can Transformer be Too Compositional? Analysing Idiom Processing in Neural Machine Translation. ACL 2022. [PDF]

Lesly Miculicich, Dhananjay Ram, Nikolaos Pappas, James Henderson. Document-Level Neural Machine Translation with Hierarchical Attention Networks. EMNLP 2018. [PDF]

Discussion Questions

  1. Six Challenges for Neural Machine Translation describes six neural machine translation challenges. Discuss how you have encountered these challenges in real-world translator use, the risks you anticipate, and how to mitigate them.

  2. There have been many trials to evaluate the performance of Machine Translation. Do you agree with the evaluation method in the paper The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation? What other factors that you think are important in how we evaluate?

Week 7: GANs and DeepFakes

(see bottom for assigned readings and questions)

Presenting Team: Aparna Kishore, Elena Long, Erzhen Hu, Jingping Wan

Blogging Team: Haochen Liu, Haolin Liu, Ji Hyun Kim, Stephanie Schoch, Xueren Ge

Monday, 9 October:
Generative Adversarial Networks and DeepFakes

Today's topic is how to utilize generative adversarial networks to create fake images and how to identify the images generated by these models.

Generative Adversarial Network (GAN) is a revolutionary deep learning framework that pits two neural networks against each other in a creative showdown. One network, the generator, strives to produce realistic data, such as images or text, while the other, the discriminator, aims to differentiate between genuine and generated data. Through a continuous feedback loop, GANs refine their abilities, leading to the generation of increasingly convincing and high-quality content.

To ensure students have a better understanding of GANs. The leading team held a “GAN Auction Game” to simulate the generating and predicting process of the generator and discriminator in GAN. In this game, students are divided into two groups (Group 1 and Group 2). Group 1 will provide three items (e.g. the name of a place) while Group 2 tries to identify whether the items provided are real or fake.

The game captures the training process of GANs where the generator first proposes certain contents (e.g. images or contexts) and the discriminator is trained to distinguish real from generated (fake) content.

If the generator successfully creates contents that fools the discriminator, it will receive a high reward for further tuning. On the other hand, if the discriminator correctly identifies the content created by the generator it receives a reward.

This iterative training process is illustrated by the figures below.

Formally, the training process can be modeled as a two-player zero-sum game by conducting min-max optimization on the objective function.A Nash equilibrium will be established between generator and discriminator.

For a system that only has generators and discriminators, it is hard to tell whether they are doing well because there are many bad local optima. Thus, one direct way is to introduce human feedback for evaluating.

For example, we can borrow strategies from Large Language Models (LLMs), particularly employing Reinforcement Learning from Human Feedback (RLHF). In this method, experts would iteratively rank the generated samples, offering direct reinforcement signals to improve the generator’s output. This approach could enhance the realism and semantic alignment of the content created by GANs. However, the RLHF method has its drawbacks, primarily the extensive need for expert involvement, raising concerns about its scalability in larger evaluations.

An alternative could be the inclusion of non-expert users, offering a broader range of feedback. Crowdsourcing and user studies are suggested as methods to understand if the generated content meets the target audience’s needs and preferences.

For images or tabular data, when the data distribution is roughly known, inception score serves as an useful metric. This score calculates the KL divergence between the conditional class distribution and the marginal class distribution of generated samples. A higher inception score (IS) indicates clearer and more diverse images. However, it doesn't always correlate with human judgment.

  1. Vanishing/Exploding Gradient: During backpropagation, gradients can shrink (vanish) or grow excessively (explode), disrupting learning. Vanishing gradients stall the network's learning, as parameter updates become negligible. Exploding gradients cause extreme, destabilizing updates, hindering the model's convergence.

  2. Mode Collapse: GANs can suffer from mode collapse, where the generator produces limited, similar samples, failing to represent the data's true diversity. This occurs when the generator exploits the discriminator's weaknesses, concentrating on certain data aspects and neglecting others. It compromises the GAN's objective of generating diverse, realistic samples, indicating a breakdown in adversarial learning.

A warm-up game is to identify the fake person that is generated by GANs.

In the above figure, one of the two faces is fake but it is difficult to identify at first glance.

To successfully identify fake images, there are several methods that either use deep-learning-based models to learn to identify fake samples, or through direct observation by people. The leading team then introduces three interesting methods that enable us to tell the difference. We will revisit these two faces later and now focus on the detailed methods to do general identification.

For example, images generated by GANs tend to contain color artifacts or invisible artifacts that can be identified by deep learning models.

The second method is physical-based. Namely, the corneal specular highlights for the real face have strong similarities while those for the GAN-faces are different.

The third method is physiological-based. Specifically, the pupils for the real eyes have strong circular shapes while the GAN-generated pupils usually have irregular shapes.

With the help of these methods, we can say that the left woman in the figure we showed before is fake. This can be justified by color artifacts of GAN-image identified from deep learning and her irregular pupils.

The leading team also believes that these identification methods can be escaped by more advanced image-generating models but new methods will also be proposed accordingly to distinguish images generated by these advanced models. The generation and identification will evolve together.

In summary, generative models such as GANs have fundamentally transformed people's lives, and there remains a substantial amount of future research and development ahead. Some future directions are listed above.

Wednesday, 11 October
Creation and Detection of DeepFake Videos


  1. Introduction to deepfake videos
  2. Detecting Face-swap deepfakes with temporal dynamics
  3. Discussion

Definition of a deepfake: A deceptive image or recording that distorts reality to deceive.

There are some side effects of face swap methods, including

  • Limited accuracy
  • Concerns of how to protect privacy

The presenters introduced three different methods of generating deepfake videos:

  1. Reenactment
  2. Lip-sync deepfakes
  3. Text-based deepfake synthesis

Reenactment: A deepfake reenacts using source images to manipulate the target.

Example of a reenactment: the mouth movement in Trump's video is animated by a source actor.

Here is another example of Reenactment, where the dancing in target video is animated by a source actor.

Three main steps of reenactment:

  1. The first step is tracking facial features in both source and target videos.
  2. A consistency measure aligns input video features with a 3D face model.
  3. Expressions are transferred from source to target with refinement for realism.

Difference between face swap and reenactment:

  • Difference is about the target image. Face swap retains part of the source image, but reenactment does not retain the background source image

Most methods use RGB images, while lip-sync relies on audio input.

  • Audio is transformed into a dynamic mouth shape.
  • The mouth texture is matched with the target video for natural motion.

Text-based methods modify videos per word, phonemes and visemes are key for pronunciation and analysis. Text edits are matched with phoneme sequences in the source video. Parameters of the 3D head model are used to smooth lip motions.

While previous works have done a lot, an overlooked aspect in the creation of these deep-fake videos is the human ear. Here is one recent work trying to tackle this problem from the aspect of ear.

Three types of authentication techniques:

  • Forensic Analysis
  • Digital Signatures
  • Digital Watermarks

Today, our focus is on forensic methods to detect deep fakes. These methods can be categorized into low- and high-level approaches.

The presenters asked the class to find criterias to identify an authentic picture of Tom Cruise. In the discussion, several factors were highlighted:

  • Posture of the third picture is unnatural.
  • Foggy background of first picture vs. Realistic background of the second picture
  • Scale/ratio of head and hand is odd in the third picture
During the class poll to determine which image appeared authentic, the majority of students voted for the second image, with a few supporting the first, and none voting for the third.

Surprisingly to the majority of the class, it was revealed that the first image was genuine, while the others were crafted by a TikTok user in creating deep fake content.

Many deep fake videos emphasizing facial expressions often neglect the intricate movements of the human ear and the corresponding changes that occur in jaw movements.

The aural dynamics system tracks and annotates ear landmarks, utilizing averaged local aural motion to simulate both horizontal and vertical movements, mirroring those of a real person.

With the videos of Joe Biden, Angela Merkel, Donald Trump, and Mark Zuckerberg, they used GAN to synthesize the mouth region of individuals to match the new audio track and generate a lip-sync video.

The graphs are a distribution of the correlation of horizontal motion of three aural areas and audio (left) and lip vertical distance (right).

Fake ones have no correlation, where individuals have strong correlation that are not necessarily consistent.

The horizontal movement of the tragus and lobule parts of Trump’s ears exhibited a positive correlation, distinguishing it as a distinctive personal trait, unlike the general pattern observed on others.

The table shows the performance of each model. Models with person-specific training show a higher average testing accuracy.

Question 1: Limitations of the proposed methods & possible improvements

  • Group 1: Poor evaluation on high-quality videos, their training dataset is low-quality.
  • Group 2: Ear detection is only possible when ear is visible.
  • Group 3: Dependent on visibility of ear and having a reference image; ability to generalize to other situations, i.e.: smaller sample sizes; you could find an actor whose biometric markers were more similar to the desired image.

As mentioned, there are drawbacks such as when hair is hiding the movement of ears, large head movement, and accurate ear tracking is difficult. Still, more facial and audio signals can be further studied.

Question 2: Anomalies found from deep fake videos

  • Group 2:
    • Shape of the mouth is generally the same; it just expands and reduces in size to mimic mouth movements when speaking. Thus the bottom teeth is never shown.
    • Light reflection on glasses when reporters move their head is not generated.
    • Lips not synced properly (the word editing does not match).
  • Group 3:
    • Fake video 1:
      • Lips seem constrained
      • Eye blinking robotic, no change iin width
      • Lack of nostril changes
    • Fake video 2:
      • Mouth/teeth
      • Symmetry

The speaker of the first video does not blink for over 6 seconds, which is impossible. The average resting blinking rate should be 0.283 per second.

The second speaker’s lips are not closing for ‘m,’ ‘b,’ and ‘p’ (Phonemes-Visemes).

Human pulse and respiratory motions are imperceptible to the human eye. Amplifying these factors could serve as a method for detecting generated videos.

Note: The method was originally designed for medical purposes, aiming to identify potential health risks in a medical setting in a non-intrusive way.

  • Group 1: The “arm-race” is win-win development for both groups. Generation and detection will learn from the feedback of each side.
  • Group 2: There always will be a case where humans get creative and find out ways to improve and get away with detection, as fraud detection does. Optimally, if people don’t use it in an unethical way, it can be useful in various ways like the film industry.
  • Group 3: What if big companies become attackers (even for research purposes)?

Both the technology and ways to detect deep-fake videos will continue to advance. However, it requires more than simply trying to generate and identify them. By using watermarks, deep-fake videos can be distinguished from the source. Furthermore, public education on teaching importance on collecting information from the correct source and further government regulations can be considered. Perhaps the biggest threat from improvements in the quality and ease of creating fake imagery, is that people will lose confidence in all images and assume everything they see is fake.


For the first class (10/9)

For the second class (10/11)

Optional Additional Readings

On GAN Training

Blogs and Tutorials

Discussion Questions

For Monday’s class: (as usual, post your response to at least one of these questions, or respond to someone else’s response, or post anything you want that is interesting and relevant, before 8:29pm on Sunday, 8 October)

  1. How might the application of GANs extend beyond image generation to other domains, such as text, finance, healthcare (or any other domain that you can think of) and what unique challenges might arise in these different domains? How can the GAN framework ensure fairness, accountability, and transparency in these applications?
  2. Considering the challenges in evaluating the performance and quality of GANs, how might evaluation metrics or methods be developed to assess the quality, diversity, and realism of the samples generated by GANs in a more robust and reliable manner? Additionally, how might these evaluation methods account for different types of data (e.g., images, text, tabular etc.) and various application domains?
  3. The authors identify 2 methods of detecting GAN-based images: physical and physiological. Is it possible that we can train a new model to modify a GAN-based image to hide these seemingly obvious flaws, like the reflection and pupil shapes? Will this approach quickly invalidate these two methods?
  4. Do you agree with the authors that deep-learning based methods lack interpretability? Is the visible or invisible patterns detected by DL models really not understandable or explainable?

Questions for Wednesday’s class: (post response by 8:29pm on Tuesday, 10 October)

  1. What are the potential applications for the techniques discussed in the Agarwal and Farid paper beyond deep-fake detection, such as in voice recognition or speaker authentication systems?
  2. How robust are the proposed ear analysis methods to real-world conditions like different head poses, lighting, occlusion by hair?
  3. What are your ideas for other ways to detect deepfakes?
  4. Deepfake detection and generation seems similar to many other “arms races” between attackers and defenders. How do you see this arms race evolving? Will there be an endpoint with one side clearly winning?