Using sort and uniq
From Rous
Summarize next generation sequence data can be quicky summarized with the commands sort and uniq:
- Examine the files shortseq.txt.
head -n5 shortseq.txt wc -l shortseq.txt
- each line contains a short nucleotide sequence and there are 1000 lines
- sort the sequences and view the first 10:
sort shortseq.txt | head -n10
- There are multiple representatives of each sequence.
- create a non-redundant list using uniq:
sort shortseq.txt | uniq | wc -l
- Add a count for the number of times each one occurs:
sort shortseq.txt | uniq -c | head -n20
- Ask specific questions about the output such has return all sequences that occur 4 times:
sort shortseq.txt | uniq -c | awk '$1 == 4'
- generate a the data for a frequency histogram:
sort shortseq.txt | uniq -c | awk '{print $1}' | sort | uniq -c