Using sort and uniq

From Rous
Jump to: navigation, search

Summarize next generation sequence data can be quicky summarized with the commands sort and uniq:

  • Examine the files shortseq.txt.
head -n5 shortseq.txt
wc -l shortseq.txt
  • each line contains a short nucleotide sequence and there are 1000 lines
  • sort the sequences and view the first 10:
sort shortseq.txt | head -n10
  • There are multiple representatives of each sequence.
  • create a non-redundant list using uniq:
sort shortseq.txt | uniq | wc -l
  • Add a count for the number of times each one occurs:
sort shortseq.txt | uniq -c | head -n20

  • Ask specific questions about the output such has return all sequences that occur 4 times:
sort shortseq.txt | uniq -c | awk '$1 == 4'
  • generate a the data for a frequency histogram:
sort shortseq.txt | uniq -c | awk '{print $1}' | sort | uniq -c
Personal tools