This kind of problem is usually solved using machine learning techniques.
Break down the signal into a sequence of 20ms or 50ms frames. Extract features on each frame. MFCC is generally good for this kind of application, though there are features more specific to voice detection (4 Hz modulation energy - which is roughly the rate at which people speak; zero-crossing rate).
Then, using a training set of audio you have manually labeled as being a speech / not speech, train a classifier (Gaussian mixture models, SVM...) on the frames features.
This will permit you to classify unlabelled frames into speech/non-speech categories. The last step consists of smoothing the decisions (a frame classified as non-speech surrounded by hundreds of speech frames is likely to be classification error), for example using HMMs, or just a median filter.
A few references:
Robust speech/music classification in audio documents (Pinquier & al) Speech/music discrimination for multimedia system applications (El-Maleh & al) A comparison of features for speech/music discrimination (Carey & al).
Note that the features and classification techniques they describe also are relevant for the 1-class problem of detecting speech (instead of discriminating speech vs something else). In this case, you can use 1-class modeling techniques such as 1-class SVM, or just take the likelihood score out of a GMM trained on speech data as a "speechless" measure.
If, on the other hand, your problem is really discriminating speech vs something else (say music), you could also very well use unsupervised approaches that are focused on detecting the boundaries between similar audio content - instead of characteristic this content itself.
A better understanding of this will be learned through Machine Learning Tutorials