It’s quite common to find interviews, podcasts or any other audio tracks full of noises.
Those could’ve been the result of a bad mic, a bad mic isolation or if the recording was outdoor with wind, rain, etc.

Human voice frequency range is between 300Hz – 3000Hz
There are a few ways to isolate that range, the easiest is to apply a lowpass an a highpass filter to cut all the noises out and enhance voices.

To do so we can use ffmpeg, first you can preview the results with ffplay:

ffplay INPUT -af lowpass=3000,highpass=200

and modify the 2 values as you wish.
With youtube-dl you can also preview your filters without downloading the track:

ffplay $(youtube-dl -g VIDEO-URL |sed 1d) -af lowpass=3000,highpass=200

Then apply the desired modifications to a new file with ffmpeg:

ffmpeg -i INPUT -af lowpass=3000,highpass=200 OUTPUT