Top tips to keep your home folder on a server tidy
All bioinformatics server users and administrators would know how easy it is to fill up our home directories with huge amounts of data, especially when you are analysing deep sequencing data on a daily basis.
Here is a list of a few useful commands and tips that can help to keep your home directory tidy.
- Do not copy fastq files to multiple locations, create soft links instead using the following command in your working directories.
ln -s /source/file/location/file1.fastq /destination/location/file1.fastq
- Always convert .sam files to .bam files
samtools view -b file.sam > file.bam
- Zip any data files/folders that are not going to be used for next few weeks.
tar -cf files.tar file1 file2 file3 bzip2 files.tar
- To compress a directory and all the data within it run the following command
tar -zcvf archive.tar.gz directory
- Organise your home directory well.
Keep all reference sequences in one folder
Keep all indexes in one folder (this could be the same folder as the references for simplicity)
- Always delete temporary and intermediate files and keep a log of deleted files in a text file.
- Empty the trash folder if you use a GUI or Virtual Desktop Environment
- Use ncdu, tree or baobab (GUI) commands to find out disk consumption
ncdu /home/folder tree -sh /home/folder
- Find out the size of your home directory using the following du command
du -sch /home/folder
- For advanced users:
As mentioned in this stackoverflow forum, if you would like to get a list of multiple copies of files in your directory use the following set of commands.
find /home/folder -type f -exec md5sum {} \; > md5sums gawk '{print $1}' md5sums | sort | uniq -d > dupes while read d; do echo "---"; grep $d md5sums | cut -d ' ' -f 2-; done < dupes > dupes_list
brilliant!