We provide signatures of the functions that you have to implement. Make sure you follow the signatures defined, otherwise your coding solutions will not be graded.
Please submit the single Jupyter Notebook file, where only Python and Markdown/$\LaTeX$ are used. Any hand-written solutions inserted by photos or in any other way are prohibitive and will not be graded. If you will have any questions about using Markdown, ask them!
The works will be checked for plagiarism. The score will be divided by the number of similar works.
Some tasks in this assignment are bonus. It means that you can get maximum score for this assignment even if you ignore these tasks. However, if you get $y$ points for bonus tasks and $x$ points for regular tasks, then your score for the assignment will be $\min(90, x+y)$. If $x + y > 90$, then we will label your submission and take into account in the final grade releasing (e.g. in the border case).
Our goal is to write an orthogonalization layer, that for an arbitrary rectangular matrix $A$ produces an orthonormal matrix $Q$ and (optionally) an upper triangualr matrix $R$ such that $A = QR$. In other words, we need to compute the QR decomposition.
There are problems when one need to optimize over orthogonal matrices, so using such layer can be a solution.
A standard algorithm to compute the QR decomposition is the Gram-Schmidt process. For the linearly independent set of vectors $a_1,\dots,a_n$ (that are defined with the matrix $A$) it computes matrix $Q$ that contains orthonormalized set of vectors $q_1,\dots,q_n$.
This algorithm is listed below.
$$ \begin{split} u_1 &= v_1, \\ u_2 &= v_2 - \frac{(v_2, u_1)}{(u_1, u_1)} u_1, \\ \dots \\ u_n &= v_n - \frac{(v_n, u_1)}{(u_1, u_1)} u_1 - \frac{(v_n, u_2)}{(u_2, u_2)} u_2 - \dots - \frac{(v_n, u_{n-1})}{(u_{n-1}, u_{n-1})} u_{n-1}. \end{split} $$Obtained $u_1, \dots, u_n$ are orthogonal vectors in exact arithmetics. Then to make the system orthonormal you should divide each of the vectors by its norm: $u_i := u_i/\|u_i\|$.
We have $A = QR$. Matrix $Q$ consists of vectors $u_1, \dots, u_n$. Now let's look at the matrix $R$.
$Q^*A = R$. So $R_{i, j} = (u_i, v_j)$ (assuming $u_i$ is already normalized). And it is exactly what we see in Gram-Schmidt algorithm:
$$ v_n = \frac{(v_n, u_1)}{(u_1, u_1)} u_1 + \frac{(v_n, u_2)}{(u_2, u_2)} u_2 + \dots + \frac{(v_n, u_{n-1})}{(u_{n-1}, u_{n-1})} u_{n-1} + u_n.$$This is the representation of vectors $v_i$ in our newly computed ortonormal basis. Thus they form matrix $R$. These coefficients are in the upper triangular part of $R$, because when $i>j$ we get $R_{i j} = 0$, because then we get sum of scalar products of orthogonal vectors.
There is a more computationally stable version of the Gram-Schmidt algorithm that reduces the loss of orthogonality in matrix $Q$. This algorithm is called the modified Gram-Schmidt algorithm.
The loop of orthogonalization is the same, except for the computing the vector $u_k$ simultaneously:
$$ u_n = v_n - \frac{(v_n, u_1)}{(u_1, u_1)} u_1 - \frac{(v_n, u_2)}{(u_2, u_2)} u_2 - \dots - \frac{(v_n, u_{n-1})}{(u_{n-1}, u_{n-1})} u_{n-1} u_n$$$$ = v_n - \frac{(v_n, u_1)}{(u_1, u_1)} u_1 - \frac{(v_n, u_2)}{(u_2, u_2)} u_2 - \dots - \frac{(v_n, u_{n-1})}{(u_{n-1}, u_{n-1})} u_{n-1}, $$one need to iteratively update $u_k$ step-by-step for a better numerical stability.
$$ \begin{split} u_k &= v_k - \frac{(v_k, u_1)}{(u_1, u_1)} u_1, \\ u_k &= u_k - \frac{(u_k, u_2)}{(u_2, u_2)} u_2, \\ \dots \\ u_k &= u_k - \frac{(u_k, u_{k-1})}{(u_{k-1}, u_{k-1})} u_{k-1}. \\ \end{split} $$You are supposed to do the following tasks:
1. (2 points) Implement the modified Gram-Schmidt algorithm in a function modified_gram_schmidt(A)
using PyTorch.
2. (1 point) Create a random matrix $A \in \mathbb{R}^{m \times n}$, $m \gg n$. For the arbitrary loss function, eg. $L(Q) = \sum_{i=0}^{m-1}\sum_{j=0}^{n-1} q_{ij}$, compute the gradient $\frac{\partial L}{\partial Q}$. Finally, compute $\frac{\partial L}{\partial A}$ using built-in PyTorch backpropagation through your implementation of QR decomposition (modified_gram_schmidt(A)
).
3. (2 points) Compute gradient $\frac{\partial L}{\partial A}$ using backpropagation through built-in PyTorch function for computing QR decomposition.
# install dependencies
!pip install memory-profiler
import torch
def modified_gram_schmidt(A):
"""
Computes QR decomposition of matrix A
Input:
A - n x m matrix
Output:
Q - n x m orthonormal matrix
R - m x m upper triangular matrix
"""
# Your code here
return Q, R
m = 100
n = 20
A = torch.rand((m, n))
# Your code is below
Q1, R1 = # torch version
Q2, R2 = # modified_gram_schmidt
loss1 = Q1.sum()
loss2 = Q2.sum()
# TODO compute gradients
4. (13 points total) Write the backpropagation $\frac{\partial L}{\partial A}$ from scratch with PyTorch. Do not forget to use requires_grad=False
to disable PyTorch autograd accumulation of the gradient.
a) (2 points) Derive analytically the gradient $\frac{\partial y}{\partial x}$ of vector normalization operation $y := x/\|x\|$.
b) (1 point) Implement backward through the vector normalization operation (can be implemented either in the function backward_normalization
or further inside the code).
c) (2 points) Derive analytically the gradient $\frac{\partial u_k}{\partial v_j}$ of the othonormalization operation:
$$ \begin{split} u_k &= v_k - \frac{(v_k, u_1)}{(u_1, u_1)} u_1, \\ u_k &= u_k - \frac{(u_k, u_2)}{(u_2, u_2)} u_2, \\ \dots \\ u_k &= u_k - \frac{(u_k, u_{k-1})}{(u_{k-1}, u_{k-1})} u_{k-1}. \\ \end{split} $$d) (2 points) Implement backward through the othonormalization operation (can be implemented either in the function backward_orthogonalization
or further inside the code).
e) (1 point) Implement the QR forward pass (almost identically to modified_gram_schmidt
, but with storing the additional data nesessary for backpropagation) with the method forward
of class QR
(see template below).
f) (5 points) Implement the QR backward pass using previously implemented functions and place it in the method backward
of class QR
(see template below).
5. (2 points) Look at obtained $Q$ and $R$ matrices for each approach (steps 2, 3, 4), verify that $A \approx QR$ and $Q^TQ \approx I$. Measure the required memory for backpropagation (eg. with memory_profiler
).
Compare the results of three approaches:
modified_gram_schmidt
Compare the theoretical asymptotics of the memory usage for the backpropagation of the modified Gram-Schmidt algorithm with every approach listed above. What data is needed for backpropagation and thus affects the memory consumption?
def backward_normalization(...):
# Your code
def backward_orthogonalization(...):
# Your code
class QR():
def forward(self, A):
"""
Computes QR decomposition of matrix A
Input:
A - n x m matrix
Output:
Q - n x m orthonormal matrix
R - m x m upper triangular matrix
"""
# Your code
return Q, R
def backward(self, grad_output):
"""
Computes QR decomposition of matrix A
Input:
grad_output - n x m matrix, derivative of the previous layer (derivative of loss dL/dQ in our case)
Output:
grad_input - n x m derivative dL/dA
"""
# Your code
return grad_input
qr = QR()
A = torch.rand((m, n))
Q, R = qr.forward(A)
loss1 = Q.sum()
# TODO compute losses and do backpropagation
# compute dL / dQ and finally dL / dA
...
dL_dA = ...
A_autograd = A.clone().detach().requires_grad_(True)
Q_autograd, R_autograd = qr.forward(A_autograd)
loss2 = Q_autograd.sum()
loss2.backward()
print('Difference between gradients:', float(torch.linalg.norm(dL_dA - A_autograd.grad)))
6. (10 points) Modify the previous code to be memory-efficient.
Hint: not all intermediate variables need to be stored for backpropagation.
7. (3 points) Again, measure the required memory for backpropagation. Explain the theoretical asymptotics of the memory usage for the memory-efficient backpropagation.
# Your solution is here
In this assigment you will have a chance to develop a voice-anti-spoofing algorithm, enhancing your skills in standard PyTorch classificaton development pipeline.
There are several types of attack:
In this work we will focus on deepfake attack and try to differ them from genuine (or bona-fide) speech.
We will use data, provided by organisers of biannual ASVspoof contest. More precisely, we will take subpart ASVspoof_2019LA
dataset, which consist of 17 different attacks DF attack types and bona-fide. All recordings were augmented to be close to real telephone recordings. We will not take 2021 data, but if you want, you can use 2021 data as test and 2019 data as train and validation.
The initital dataset can be downloaded from here. However, we provide you smaller subsample of dev (val) and eval (test) parts to make work in colab relatively possible :) . You can download it locally or put into you Drive. You will need around 3GB free space, think about creating a new account if you don't have enough.
We encourage you to look through data overview and result summary paper.
One important thing to notice: main application of this algorithms is in voice-biometry, when we want to stop illegal intruder. Thus, there are two (or 3) common ways of developing VAS algorithms: speaker-aware, we we train verification model which is sensitive to spoofing and differs bona-fide and impostor or we combinde verification score and score of VAS model to decide, whetherr the person is the same, and finally simple and general countermeasures setup, when given audio and model should predict whether it is a spoof or bona-fide. We will work on the last setup.
Important: feel free not to use pre-defined functions, you can solve the task as you wish or change functions/pipelines significantly.
It is a good practice to log and check you experiments. In this assignemnt, we ask you to sign up in wandb, log your experiments and give us a link to your project and attach plots/report in the notebook with your solution.
!pip install wandb
import copy
import os
import sys
import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm
import IPython.display as ipd
from typing import Optional
import torch
import torch.nn as nn
import torchaudio
# import librosa
import sklearn.metrics as metrics
import wandb
from torchvision.models import resnet50, resnet18
from torch.utils.data import Dataset, DataLoader
from torch.nn.functional import sigmoid
from scipy.optimize import brentq
from scipy.interpolate import interp1d
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import roc_auc_score, roc_curve
# https://wandb.ai/quickstart
!wandb login # and paste your API key from https://wandb.ai/authorize
wandb.init(project="voice-anti-spoofing", name = "test_run", tags = ["SmallModel", "LA"])
SEED = 42 # do not change it
# mount drive if you wish
from google.colab import drive
drive.mount('/content/drive/')
%cd /content/drive/MyDrive/
!ls for_colab
# to make dataframes
path_la_train = "for_colab/ASVspoof2019_LA_cm_protocols/ASVspoof2019.LA.cm.train.trn.txt"
path_la_dev = "for_colab/ASVspoof2019_LA_cm_protocols/ASVspoof2019.LA.cm.dev.trl.txt"
path_la_eval = "for_colab/ASVspoof2019_LA_cm_protocols/ASVspoof2019.LA.cm.eval.trl.txt"
flac_la_train = "for_colab/train_flac/"
flac_la_dev = "for_colab/flac_dev/"
flac_la_eval = "for_colab/flac_eval/"
columns = ["speaker_id", "audio_file_name", "system_id", "skip", "class"]
df = pd.read_csv(path_la_train, sep=" ", header=None)
df.columns = columns
df.head()
def makedf_ultimate(txt_file_path, flac_path, dataset_name='ASVspoof', data_type='LA', year=2019):
if dataset_name=='ASVspoof':
name = dataset_name+str(year)+data_type
if data_type=='DF':
attack_name = 'deepfake'
elif data_type=='LA':
attack_name = 'deepfake'
else:
attack_name = 'replay'
df = pd.read_csv(txt_file_path, sep=" ", header=None)
if year == 2019:
df.columns = ["speaker_id", "audio_file_name", "system_id", "skip", "class"]
df['dataset'] = name
di = {'bonafide':0, 'spoof':1} # assign classes
df['class'] = df['class'].map(di)
# m = df['class'] == 0
# df.loc[m, 'attack'] = 'noattack'
df['audio_path'] = flac_path + df['audio_file_name'] + '.flac'
df = df[["speaker_id", "audio_path", "class"]]
# df = df[['audio_path', 'class', 'dataset', 'attack']]
df = df[["speaker_id", "audio_path", "class"]]
df = df.reset_index(drop=True)
return df
df_train = makedf_ultimate(path_la_train, flac_la_train, year=2019, data_type='LA')
df_dev = makedf_ultimate(path_la_dev, flac_la_dev, year=2019, data_type='LA')
df_eval = makedf_ultimate(path_la_eval, flac_la_eval, year=2019, data_type='LA')
Task 1.1 (2 points) Check data distribution. Is it imbalanced? How are you going to deal with it?
Hints: upsampling, downsampling, add new data, e.g. from this source (but will new audios have the same distribution?), or adjust loss functions.
Answer
...
# for dev and eval, let's simplify the task and use smaller amount of data, but balanced to better understand metrics
# we need this functions, because it removes unuploaded data from our dataframes
def balanced_split(df):
k = min(len(df[df['class']==0]), len(df[df['class']==1]))
print(k)
df_bf = df[df['class']==0].sample(frac=1, random_state=SEED).iloc[0:k]
df_spoof = df[df['class']==1].sample(frac=1, random_state=SEED).iloc[0:k]
df_bf = df_bf.append([df_spoof], ignore_index=True)
df_bf = df_bf.reset_index(drop=True)
return df_bf
df_eval = balanced_split(df_eval)
df_dev = balanced_split(df_dev)
# for audio manipulations we advise you to use torchaudio or librosa
x, sr = torchaudio.load(df_eval["audio_path"].iloc[0])
ipd.display(ipd.Audio(x, rate=sr))
Task 1.2. (1 point) Display several spoof and bona-fide audios. Can you hear the difference?
### Your code is here
You can find inspiration and hints in
Task 2.1 (5 points) We can work with audios as with images, transforming into spectrograms. Here your task is to implement simple model, which receives raw wav (amplitudes, but probably already preprocced in dataset), transforms it into mel-spectrogram, changes amplitude to DB scale (if you want) and then procceses through the layers. You are free to use ready pre-trained backbones, e.g. from torchvision.models
and fine-tune them. However, if you want to practise more, create something on your own. Large models with requires_grad=True
for all parameters will probably result in problems on training in colab.
class WavResNet(nn.Module):
def __init__(self, classes=None, resample=16000, n_mels=80, melspec_config=None):
super().__init__()
self.fbank = ...
self.to_db = ...
model = ...
model.conv1 = ...
num_ftrs = ...
model.fc = nn.Linear(num_ftrs, classes)
self.network = model
# or create your own layers and use them in forward pass
def forward(self, x, wav_lens: Optional[torch.Tensor]=None): # can check the length if you want. this is helpful for inference
mels = ...
mels_db = ...
mels_db = ...
out = ...
return out
Task 2.2 (1 point) In your opinion, which approach is better for binary classification:
Task 2.3 (3 points) Create custom dataset, which recieves df
and returns preprocessed audio.
Task 2.4 (1 point) Should we use augmentaions? If yes, which ones?
class DatasetWav(Dataset):
def __init__(self, data_frame, padding_sec=4, default_sr=16000, transform=None):
self.df = data_frame
self.padding_sec = padding_sec
self.default_sr = default_sr
self.labels = ...
self.paths = ...
self.vad = ... # in this task you are free to ommit it in order to speed up calculations,
# also the provided data should be rather clean
def __len__(self):
return
def __getitem__(self, idx):
label = ...
path = ...
waveform , sr = ... # normalize=True
# transform waveform from stereo to mono channel
waveform = ...
resample_transform = torchaudio.transforms.Resample(orig_freq=sr, new_freq=self.default_sr) # should we use it for our data or can ommit?
waveform = ...
# waveform = self.vad(waveform)
# came up with idea, what to do if audio is longer or shorter than reuqired
if (len(waveform) < self.padding_sec * self.default_sr):
pass
else:
pass
return x, label
# check that works
batch_size = None
train_dataset_wavs = DatasetWav(df_train) #.iloc[0:200]
train_dataloader_wavs = DataLoader(train_dataset_wavs, batch_size=batch_size, shuffle=True, num_workers=8)
x, y = next(iter(train_dataloader_wavs))
model = ...
model(x)
Task 3.0 (1 point) What is the difference between model.train()
and model.eval()
? Does model.eval()
mode take gradient statisitcs into account?
Task 3.1 (5 points) Implement train and test functions, which iterate over all batches. Do logging of loss, accuracy on each batch and after every epoch. Check equal error rate EER, explain, what is it.
def calculate_eer(y, y_score):
fpr, tpr, thresholds = roc_curve(y, y_score, pos_label=1)
eer = brentq(lambda x : 1. - x - interp1d(fpr, tpr)(x), 0., 1.)
thresh = interp1d(fpr, thresholds)(eer)
return eer, thresh
def train(model, dataloaders, criterion, optimizer,
num_epochs=3, scheduler=None, savename=None,
print_counter=10, decay_factor=10,
device=torch.device("cuda")):
pass
def test(model, test_dataloader, criterion=None, device=None, savename=None):
pass
Task 4.1 (3 points) What loss will you choose and why? What it should receive for input (logits, probs)? Explain your answer. Consider also:
Task 4.2 (5 points) Implement main function, which receives yaml or json config (or path to it) and train and test the model, save model checkpoints, and model's test predictions. At least, you should obain results better than random. It is good, if accuracy on test dataset is >= 0.75.
def main(config):
lr =
epochs =
batch_size =
# wandb.init(project="voice-anti-spoofing", name = "test_run", tags = ["ResNet", "LA"])
wandb.config = {
"learning_rate": lr,
"epochs": epochs,
"batch_size": batch_size
}
# debug firstly on small subpart of dataset
train_dataset_wavs = DatasetWav() #.iloc[0:200]
train_dataloader_wavs = DataLoader()
val_dataset_wavs =
val_dataloader_wavs =
test_dataset_wavs =
test_dataloader_wavs =
dataloaders = {"train": train_dataloader_wavs, "val": val_dataloader_wavs}
device =
classes =
model =
model = model.to(device)
# freeze layers except the last one if you want
for param in model.parameters():
param.requires_grad = ...
model.network.fc.requires_grad_(...)
optimizer =
scheduler =
sc = torch.tensor([alpha, 1 - alpha]).to(device) # weights for loss
criterion = nn.CrossEntropyLoss(weight=sc)
for_ckpts = "."
os.makedirs(for_ckpts, exist_ok=True)
model, val_acc_history = train(...)
y, probs = test(...)
return y, probs
y, probs = main(...)
Training might require a long time. So, you can try to use subpart of the dataset to receive rather good results. Also, you don't have to train for many epochs, 1 or even a half of epoch might be enough. Also consider to make some speedups in the model and dataset.
Colab also might work unstable with lots of data. If this happens, think how to deal with it, e.g. restart the kernel and/or remove the majority of data from Drive and try to bugfix firstly, then do train and test separately.
Task 4.3 (5 points) Analyze the results. Play some audios with wrong predictions. Plot prediction distributions vs class. Is it possible to find better threshold?
Task 4.4 (5 points) Discuss the results and your approaches + hyperparameters. If you can, run some experiments with different parameters.
Write your ideas, what else you can try to improve the results in the future.
Task 4.5 (3 points) Provide link to wandb project, or link to report. Upload best model's weights to Drive and provide a link with access right.
# Your solution is here
Bonus task. (50 points)
So far, we didn't care about speaker_id and verification. Take verification model from whatever you want and create counter measure subnetwork, as described in this article. Train the model, as described there. ASVspoof also provides Enrollment(anchor) recordings. Try to make blending, to obtain better metrics for counter measures.
These links might be helpful:
For FCNN (Fully connected Neural Networks) \begin{equation} f(x)=f_k\left(f_{k-1}\left(\ldots\left(f_0(x)\right)\right),\right. \end{equation} assume the nonlinearity function $\sigma$ is ReLU and loss function $L$ is squared error loss function. Prove that the deep learning model is not unique for any datasets, i.e. the optimization problem \begin{equation} \min_\theta \ g(\theta)=\frac{1}{N} \sum_{i=1}^N L\left(y_i, \hat{y}_i\right), \text{where} \ \ \hat{y}_i=f\left(x_i, \theta\right), \end{equation} dose not have an unique global minimizer $\theta$.
The softmax activation function is \begin{equation} \mathrm{Softmax}(\mathbf{z})_i=\frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}} \ \text { for } \ i=1, \ldots, K \text { and } \mathbf{z}=\left(z_1, \ldots, z_K\right) \in \mathbb{R}^K. \end{equation} Suppose $\mathbf{z} \in \mathbb{R}^K$, and $\mathbf{a} \in \mathbb{R}^K$, Prove that for the optimization problem \begin{equation} \begin{aligned} & \min_{\mathbf{a}} \ -\langle\mathbf{a}, \mathbf{z}\rangle+\langle\mathbf{a}, \log \mathbf{a}\rangle \\ & \text { s.t. } \sum_k^K \mathbf{a}_{k}=1, \end{aligned} \end{equation} the minimizer is \begin{equation} \mathbf{a}^*=\mathrm{Softmax}(\mathbf{z}). \end{equation}
Hint: KKT optimality conditions and convexiy of the problem can help you to show the required equality
For function $f(x)=x^2, x \in \left[0, 1\right]$,
where $n$ is the number of layers in the neural network.
Hint: For each $ x \in \left[0, 1\right]$, $x$ can be denoted by its binary expansion $x=\sum_{i=0}^{∞} x_i/2^i$, where $x_i \in \{ 0, 1\}$. The above structure can be used to find $x_0,\dots, x_n$. Then we can write $\hat f(x)=f\left(\sum_{i=0}^{∞} x_i/2^i\right).$
After the proof, if we want to achieve $\epsilon$ appoximation error based on the above neural network, the number of layers $n$ has to satisfy the condition $\frac{1}{2^{n-1}}\leq \epsilon$, i.e. $n\geq \log_2 \frac{1}{\epsilon}$.
# Your solution is here